Robert Tibshirani
Professor of Biomedical Data Science and of Statistics
Department of Biomedical Data Science
Bio
Robert Tibshirani's main interests are in applied statistics, biostatistics, and data mining. He is co-author of the books "Generalized Additive Models" (with Trevor Hastie, Stanford), "An Introduction to the Bootstrap" (with Brad Efron, Stanford), and "Elements of Statistical Learning" (with Trevor Hastie and Jerry Friedman, Stanford). His current research focuses on problems in biology and genomics, medicine, and industry. With Stanford collaborator Balasubramanian Narasimhan, he also develops software packages for genomics and proteomics.
Academic Appointments
-
Professor, Department of Biomedical Data Science
-
Professor, Statistics
-
Member, Bio-X
-
Member, Stanford Cancer Institute
Administrative Appointments
-
Professor, Department of Biomedical Data Science and Department of Statistics, Stanford University (2015 - Present)
-
Professor, Department of Health Research and Policy and Department of Statistics, Stanford University (1998 - 2015)
-
Professor, Department of Public Health Sciences and Department of Statistics, University of Toronto (1994 - 1998)
-
Associate Professor, Department of Statistics, University of Toronto (1989 - 1994)
-
Associate Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1989 - 1994)
-
Assistant Professor, Department of Statistics, University of Toronto (1985 - 1989)
-
Assistant Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1985 - 1989)
Honors & Awards
-
Doctor Honoris Causa, University of Waterloo (2018)
-
Elected Member, National Academy of Sciences (2012)
-
Gold Medal, Statistical Society of Canada (2012)
-
Alumni Achievement Award, University of Waterloo (2006)
-
Fellow, Royal Society of Canada (2001)
-
CRM-SSC Prize in Statistics, Statistical Society of Canada (2000)
-
E.W. Steacie Memorial Fellowship, Natural Sciences and Engineering Research Council of Canada (1997)
-
President's Award, Committee of Presidents of Statistical Societies (1996)
-
Guggenheim Fellowship, J. Guggenheim Foundation (1994)
-
Fellow, Institute of Mathematical Statistics (1993)
-
Fellow, American Statistical Association (1992)
Boards, Advisory Committees, Professional Organizations
-
Associate Editor, Annals of Applied Statistics (2006 - Present)
-
Associate Editor, PLoS Biology (2001 - 2004)
-
Member, Screening Panel, National Science Foundation (1999 - 1999)
-
Associate Editor, Annals of Statistics (1998 - Present)
-
Associate Editor, Statistical Science (1995 - Present)
-
Chair, Committee on Computerization, Institute of Mathematical Statistics (1995 - Present)
-
Associate Editor, Canadian Journal of Statistics (1995 - 1997)
-
Program Chair, Statistical Computing, American Statistical Association (1995 - 1996)
-
Annual Meeting Program Chair, Statistical Society of Canada (1994 - 1994)
-
Series Editor, Computing and Graphics Monographs, Chapman & Hall (1994 - 1994)
-
Council Member, Institute of Mathematical Statistics (1991 - 1994)
-
Member, Statistical Sciences Grant Selection Committee, Natural Sciences and Engineering Research Council of Canada (1989 - 1993)
-
Associate Editor, Canadian Journal of Statistics (1988 - 1991)
-
Associate Editor, Theory and Methods, Journal of the American Statistical Association (1986 - 1995)
Professional Education
-
B.Math., University of Waterloo, Statistics and Computer Science (1979)
-
M.Sc., University of Toronto, Statistics (1980)
-
Ph.D., Stanford University, Statistics (1984)
Current Research and Scholarly Interests
My research is in applied statistics and biostatistics. I specialize in computer-intensive methods for regression and classification, bootstrap, cross-validation and statistical inference, and signal and image analysis for medical diagnosis.
2024-25 Courses
- Literature of Statistics
STATS 319 (Win) - Statistical Learning and Data Science [Flipped]
STATS 202F (Win) -
Independent Studies (10)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIODS 299 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Graduate Research
IMMUNOL 399 (Aut, Win, Spr, Sum) - Independent Study
STATS 199 (Aut, Win, Spr, Sum) - Independent Study
STATS 299 (Aut, Win, Spr, Sum) - Industrial Research for Statisticians
STATS 398 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
HRP 370 (Aut, Win, Spr, Sum) - Research
STATS 399 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2023-24 Courses
- Literature of Statistics
STATS 319 (Win)
2022-23 Courses
- Biomedical Informatics Student Seminar
BIODS 201, BIOMEDIN 201 (Aut) - Modern Applied Statistics: Learning
STATS 315A (Win)
2021-22 Courses
- Modern Applied Statistics: Learning
STATS 315A (Win) - Workshop in Biostatistics
BIODS 260A, STATS 260A (Aut)
- Literature of Statistics
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Meelad Amouzgar, Camilo Espinosa Bernal, Kevin Fry, Sophia Lu, Aaron Mishkin, Nick Phillips -
Doctoral Dissertation Advisor (AC)
Erin Craig, Daisy Ding, Max Schuessler, Min Sun -
Orals Evaluator
James Yang -
Doctoral Dissertation Co-Advisor (AC)
Ivy Zhang -
Doctoral (Program)
Yixing Jiang, Gowri Nayar, Betty Xiong
All Publications
-
Glaucoma classification through SSVEP derived ON- and OFF-pathway features.
medRxiv : the preprint server for health sciences
2024
Abstract
Recent evidence from small animal models and human electrophysiology suggests that the OFF-pathway is more vulnerable to glaucomatous insult than the ON-pathway. Thus, OFF-pathway based measurements of visual function may be useful in the diagnosis of Glaucoma. The steady-state visually evoked potential (SSVEP) can be used to non-invasively make such functional measurements. Here, we examine whether OFF- and ON-pathway biasing SSVEP measurements differently predict glaucoma diagnosis using a large cohort of 98 glaucoma patients and 71 controls. Using both a logistic regression with k-fold cross-validation and a random forest classifier, we show that OFF-pathway biasing features produce a small improvement in predictive accuracy over ON-pathway biasing features. However, despite our inclusion of many more response features and the retention of both participants' eyes, our classifier did not perform as well as previous reports that used the isolated-check VEP. This is likely a result of the relatively small amount of data we collected for each participant, but may also be explained by the absence of any train-test splitting in preexisting work. Nevertheless, our results support further exploration of the diagnostic potential of OFF-pathway biasing functional biomarkers for glaucoma.
View details for DOI 10.1101/2024.08.22.24312443
View details for PubMedID 39228700
View details for PubMedCentralID PMC11370506
-
Multimodal data fusion using sparse canonical correlation analysis and cooperative learning: a COVID-19 cohort study.
NPJ digital medicine
2024; 7 (1): 117
Abstract
Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients: Intensive care unit admission. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (cor(Xu1, Zv1) = 0.596, p value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.
View details for DOI 10.1038/s41746-024-01128-2
View details for PubMedID 38714751
View details for PubMedCentralID PMC11076490
-
CAR19 monitoring by peripheral blood immunophenotyping reveals histology-specific expansion and toxicity.
Blood advances
2024
Abstract
Chimeric antigen receptor (CAR) T cells directed against CD19 (CAR19) are a revolutionary treatment for B-cell lymphomas. CAR19 cell expansion is necessary for CAR19 function but is also associated with toxicity. To define the impact of CAR19 expansion on patient outcomes, we prospectively followed a cohort of 236 patients treated with CAR19 (brexucabtagene autoleucel or axicabtagene ciloleucel) for mantle cell (MCL), follicular (FL), and large B-cell lymphoma (LBCL) over the course of five years and obtained CAR19 expansion data using peripheral blood immunophenotyping for 188 of these patients. CAR19 expansion was higher in patients with MCL compared to other lymphoma histologic subtypes. Notably, patients with MCL had increased toxicity and required four-fold higher cumulative steroid doses than patients with LBCL. CAR19 expansion was associated with the development of cytokine release syndrome (CRS), immune effector cell associated neurotoxicity syndrome (ICANS), and the requirement for granulocyte colony stimulating factor (GCSF) after day 14 post-infusion. Younger patients and those with elevated lactate dehydrogenase (LDH) had significantly higher CAR19 expansion. In general, no association between CAR19 expansion and LBCL treatment response was observed. However, when controlling for tumor burden, we found that lower CAR19 expansion in conjunction with low LDH was associated with improved outcomes in LBCL. In sum, this study finds CAR19 expansion principally associates with CAR-related toxicity. Additionally, CAR19 expansion as measured by peripheral blood immunophenotyping may be dispensable to favorable outcomes in LBCL.
View details for DOI 10.1182/bloodadvances.2024012637
View details for PubMedID 38498731
-
Intraoperative Evaluation of Breast Tissues During Breast Cancer Operations Using the MasSpec Pen.
JAMA network open
2024; 7 (3): e242684
Abstract
Importance: Surgery with complete tumor resection remains the main treatment option for patients with breast cancer. Yet, current technologies are limited in providing accurate assessment of breast tissue in vivo, warranting development of new technologies for surgical guidance.Objective: To evaluate the performance of the MasSpec Pen for accurate intraoperative assessment of breast tissues and surgical margins based on metabolic and lipid information.Design, Setting, and Participants: In this diagnostic study conducted between February 23, 2017, and August 19, 2021, the mass spectrometry-based device was used to analyze healthy breast and invasive ductal carcinoma (IDC) banked tissue samples from adult patients undergoing breast surgery for ductal carcinomas or nonmalignant conditions. Fresh-frozen tissue samples and touch imprints were analyzed in a laboratory. Intraoperative in vivo and ex vivo breast tissue analyses were performed by surgical staff in operating rooms (ORs) within 2 different hospitals at the Texas Medical Center. Molecular data were used to build statistical classifiers.Main Outcomes and Measures: Prediction results of tissue analyses from classification models were compared with gross assessment, frozen section analysis, and/or final postoperative pathology to assess accuracy.Results: All data acquired from the 143 banked tissue samples, including 79 healthy breast and 64 IDC tissues, were included in the statistical analysis. Data presented rich molecular profiles of healthy and IDC banked tissue samples, with significant changes in relative abundances observed for several metabolic species. Statistical classifiers yielded accuracies of 95.6%, 95.5%, and 90.6% for training, validation, and independent test sets, respectively. A total of 25 participants enrolled in the clinical, intraoperative study; all were female, and the median age was 58 years (IQR, 44-66 years). Intraoperative testing of the technology was successfully performed by surgical staff during 25 breast operations. Of 273 intraoperative analyses performed during 25 surgical cases, 147 analyses from 22 cases were subjected to statistical classification. Testing of the classifiers on 147 intraoperative mass spectra yielded 95.9% agreement with postoperative pathology results.Conclusions and Relevance: The findings of this diagnostic study suggest that the mass spectrometry-based system could be clinically valuable to surgeons and patients by enabling fast molecular-based intraoperative assessment of in vivo and ex vivo breast tissue samples and surgical margins.
View details for DOI 10.1001/jamanetworkopen.2024.2684
View details for PubMedID 38517441
-
Smooth multi-period forecasting with application to prediction of COVID-19 cases.
Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
2024; 33 (3): 955-967
Abstract
Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID-19 pandemic. In this paper we consider the problem of multi-period forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be "smooth" across horizons and apply it to two tasks: point estimation via regression and interval prediction via quantile regression. This methodology was developed for real-time distributed COVID-19 forecasting. We illustrate the proposed technique with the CovidCast dataset as well as a small simulation example.
View details for DOI 10.1080/10618600.2023.2285337
View details for PubMedID 39430215
View details for PubMedCentralID PMC11488784
-
Smooth Multi-Period Forecasting With Application to Prediction of COVID-19 Cases
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2024
View details for DOI 10.1080/10618600.2023.2285337
View details for Web of Science ID 001138480600001
-
Semi-supervised Cooperative Learning for Multiomics Data Fusion
SPRINGER INTERNATIONAL PUBLISHING AG. 2024: 54-63
View details for DOI 10.1007/978-3-031-47679-2_5
View details for Web of Science ID 001148056600005
-
Evaluating a shrinkage estimator for the treatment effect in clinical trials.
Statistics in medicine
2023
Abstract
The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this "winner's curse." We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.
View details for DOI 10.1002/sim.9992
View details for PubMedID 38111969
-
Multimodal Biomedical Data Fusion Using Sparse Canonical Correlation Analysis and Cooperative Learning: A Cohort Study on COVID-19.
Research square
2023
Abstract
Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (corr(Xu1, Zv1) = 0.596, p-value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.
View details for DOI 10.21203/rs.3.rs-3569833/v1
View details for PubMedID 38045288
-
Public health factors help explain cross country heterogeneity in excess death during the COVID19 pandemic.
Scientific reports
2023; 13 (1): 16196
Abstract
The COVID-19 pandemic has taken a devastating toll around the world. Since January 2020, the World Health Organization estimates 14.9 million excess deaths have occurred globally. Despite this grim number quantifying the deadly impact, the underlying factors contributing to COVID-19 deaths at the population level remain unclear. Prior studies indicate that demographic factors like proportion of population older than 65 and population health explain the cross-country difference in COVID-19 deaths. However, there has not been a comprehensive analysis including variables describing government policies and COVID-19 vaccination rate. Furthermore, prior studies focus on COVID-19 death rather than excess death to assess the impact of the pandemic. Through a robust statistical modeling framework, we analyze 80 countries and show that actionable public health efforts beyond just the factors intrinsic to each country are important for explaining the cross-country heterogeneity in excess death.
View details for DOI 10.1038/s41598-023-43407-0
View details for PubMedID 37758827
View details for PubMedCentralID PMC10533501
-
Confidence intervals for the Cox model test error from cross-validation.
Statistics in medicine
2023
Abstract
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
View details for DOI 10.1002/sim.9873
View details for PubMedID 37580906
-
Spatial proteomics reveals human microglial states shaped by anatomy and neuropathology.
Research square
2023
Abstract
Microglia are implicated in aging, neurodegeneration, and Alzheimer's disease (AD). Traditional, low-plex, imaging methods fall short of capturing in situ cellular states and interactions in the human brain. We utilized Multiplexed Ion Beam Imaging (MIBI) and data-driven analysis to spatially map proteomic cellular states and niches in healthy human brain, identifying a spectrum of microglial profiles, called the microglial state continuum (MSC). The MSC ranged from senescent-like to active proteomic states that were skewed across large brain regions and compartmentalized locally according to their immediate microenvironment. While more active microglial states were proximal to amyloid plaques, globally, microglia significantly shifted towards a, presumably, dysfunctional low MSC in the AD hippocampus, as confirmed in an independent cohort (n=26). This provides an in situ single cell framework for mapping human microglial states along a continuous, shifting existence that is differentially enriched between healthy brain regions and disease, reinforcing differential microglial functions overall.
View details for DOI 10.21203/rs.3.rs-2987263/v1
View details for PubMedID 37398389
View details for PubMedCentralID PMC10312937
-
Distinguishing Renal Cell Carcinoma From Normal Kidney Tissue Using Mass Spectrometry Imaging Combined With Machine Learning.
JCO precision oncology
2023; 7: e2200668
Abstract
Accurately distinguishing renal cell carcinoma (RCC) from normal kidney tissue is critical for identifying positive surgical margins (PSMs) during partial and radical nephrectomy, which remains the primary intervention for localized RCC. Techniques that detect PSM with higher accuracy and faster turnaround time than intraoperative frozen section (IFS) analysis can help decrease reoperation rates, relieve patient anxiety and costs, and potentially improve patient outcomes.Here, we extended our combined desorption electrospray ionization mass spectrometry imaging (DESI-MSI) and machine learning methodology to identify metabolite and lipid species from tissue surfaces that can distinguish normal tissues from clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC) tissues.From 24 normal and 40 renal cancer (23 ccRCC, 13 pRCC, and 4 chRCC) tissues, we developed a multinomial lasso classifier that selects 281 total analytes from over 27,000 detected molecular species that distinguishes all histological subtypes of RCC from normal kidney tissues with 84.5% accuracy. On the basis of independent test data reflecting distinct patient populations, the classifier achieves 85.4% and 91.2% accuracy on a Stanford test set (20 normal and 28 RCC) and a Baylor-UT Austin test set (16 normal and 41 RCC), respectively. The majority of the model's selected features show consistent trends across data sets affirming its stable performance, where the suppression of arachidonic acid metabolism is identified as a shared molecular feature of ccRCC and pRCC.Together, these results indicate that signatures derived from DESI-MSI combined with machine learning may be used to rapidly determine surgical margin status with accuracies that meet or exceed those reported for IFS.
View details for DOI 10.1200/PO.22.00668
View details for PubMedID 37285559
-
Cross-Validation: What Does It Estimate and How Well Does It Do It?
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2023
View details for DOI 10.1080/01621459.2023.2197686
View details for Web of Science ID 000989697400001
-
STABL Enables Reliable and Selective biomarker Discovery in Predictive Modeling of High Dimensional Omics Data
LIPPINCOTT WILLIAMS & WILKINS. 2023: 814-821
View details for Web of Science ID 001058985600289
-
Leading edge competition promotes context-dependent responses to receptor inputs to resolve directional dilemmas in neutrophil migration.
Cell systems
2023
Abstract
Maintaining persistent migration in complex environments is critical for neutrophils to reach infection sites. Neutrophils avoid getting trapped, even when obstacles split their front into multiple leading edges. How they re-establish polarity to move productively while incorporating receptor inputs under such conditions remains unclear. Here, we challenge chemotaxing HL60 neutrophil-like cells with symmetric bifurcating microfluidic channels to probe cell-intrinsic processes during the resolution of competing fronts. Using supervised statistical learning, we demonstrate that cells commit to one leading edge late in the process, rather than amplifying structural asymmetries or early fluctuations. Using optogenetic tools, we show that receptor inputs only bias the decision similarly late, once mechanical stretching begins to weaken each front. Finally, a retracting edge commits to retraction, with ROCK limiting sensitivity to receptor inputs until the retraction completes. Collectively, our results suggest that cell edges locally adopt highly stable protrusion/retraction programs that are modulated by mechanical feedback.
View details for DOI 10.1016/j.cels.2023.02.001
View details for PubMedID 36827986
-
A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy
SCIENCE ADVANCES
2023; 9 (3)
View details for Web of Science ID 000964550100033
-
A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy.
Science advances
2023; 9 (3): eadd1166
Abstract
Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.
View details for DOI 10.1126/sciadv.add1166
View details for PubMedID 36662860
-
Feature-weighted elastic net: using "features of features" for better prediction.
Statistica Sinica
2023; 33 (1): 259-279
Abstract
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.
View details for DOI 10.5705/ss.202020.0226
View details for PubMedID 37102071
-
Improved Relapse Prediction in Pediatric Acute Myeloid Leukemia By Deconvolving Lineage-Specific and CancerSpecific Features in Single-Cell Data
AMER SOC HEMATOLOGY. 2022: 6288-6289
View details for DOI 10.1182/blood-2022-170939
View details for Web of Science ID 000893223206132
-
CD8+ T cell differentiation status correlates with the feasibility of sustained unresponsiveness following oral immunotherapy.
Nature communications
2022; 13 (1): 6646
Abstract
While food allergy oral immunotherapy (OIT) can provide safe and effective desensitization (DS), the immune mechanisms underlying development of sustained unresponsiveness (SU) following a period of avoidance are largely unknown. Here, we compare high dimensional phenotypes of innate and adaptive immune cell subsets of participants in a previously reported, phase 2 randomized, controlled, peanut OIT trial who achieved SU vs. DS (no vs. with allergic reactions upon food challenge after a withdrawal period; n=21 vs. 30 respectively among total 120 intent-to-treat participants). Lower frequencies of naive CD8+ T cells and terminally differentiated CD57+CD8+ T cell subsets at baseline (pre-OIT) are associated with SU. Frequency of naive CD8+ T cells shows a significant positive correlation with peanut-specific and Ara h 2-specific IgE levels at baseline. Higher frequencies of IL-4+ and IFNgamma+ CD4+ T cells post-OIT are negatively correlated with SU. Our findings provide evidence that an immune signature consisting of certain CD8+ T cell subset frequencies is potentially predictive of SU following OIT.
View details for DOI 10.1038/s41467-022-34222-8
View details for PubMedID 36333296
-
Cooperative learning for multiview analysis.
Proceedings of the National Academy of Sciences of the United States of America
2022; 119 (38): e2202113119
Abstract
We propose a method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. "Cooperative learning" combines the usual squared-error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
View details for DOI 10.1073/pnas.2202113119
View details for PubMedID 36095183
-
Post-infusion CAR T-Reg cells identify patients resistant to CD19-CAR therapy
NATURE MEDICINE
2022
Abstract
Approximately 60% of patients with large B cell lymphoma treated with chimeric antigen receptor (CAR) T cell therapies targeting CD19 experience disease progression, and neurotoxicity remains a challenge. Biomarkers associated with resistance and toxicity are limited. In this study, single-cell proteomic profiling of circulating CAR T cells in 32 patients treated with CD19-CAR identified that CD4+Helios+ CAR T cells on day 7 after infusion are associated with progressive disease and less severe neurotoxicity. Deep profiling demonstrated that this population is non-clonal and manifests hallmark features of T regulatory (TReg) cells. Validation cohort analysis upheld the link between higher CAR TReg cells with clinical progression and less severe neurotoxicity. A model combining expansion of this subset with lactate dehydrogenase levels, as a surrogate for tumor burden, was superior for predicting durable clinical response compared to models relying on each feature alone. These data credential CAR TReg cell expansion as a novel biomarker of response and toxicity after CAR T cell therapy and raise the prospect that this subset may regulate CAR T cell responses in humans.
View details for DOI 10.1038/s41591-022-01960-7
View details for Web of Science ID 000852940800007
View details for PubMedID 36097223
-
LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK.
The annals of applied statistics
2022; 16 (3): 1891-1918
Abstract
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.
View details for DOI 10.1214/21-aoas1575
View details for PubMedID 36091495
View details for PubMedCentralID PMC9454085
-
LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK
ANNALS OF APPLIED STATISTICS
2022; 16 (3): 1891-1918
View details for DOI 10.1214/21-AOAS1575
View details for Web of Science ID 000828472200030
-
Prediction and outlier detection in classification problems.
Journal of the Royal Statistical Society. Series B, Statistical methodology
2022; 84 (2): 524-546
Abstract
We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set C(x) as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers x as often as possible. BCOPS returns no prediction (corresponding to C(x) equal to the empty set) if it infers x to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.
View details for DOI 10.1111/rssb.12443
View details for PubMedID 35910400
View details for PubMedCentralID PMC9305480
-
Significant sparse polygenic risk scores across 813 traits in UK Biobank.
PLoS genetics
2022; 18 (3): e1010105
Abstract
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).
View details for DOI 10.1371/journal.pgen.1010105
View details for PubMedID 35324888
-
Prediction and outlier detection in classification problems
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2022
View details for DOI 10.1111/rssb.12443
View details for Web of Science ID 000755445400001
-
Identification of end-stage renal disease metabolic signatures from human perspiration
Natural Sciences
2022
View details for DOI 10.1002/ntls.20220048
-
Can auxiliary indicators improve COVID-19 forecasting and hotspot prediction?
Proceedings of the National Academy of Sciences of the United States of America
1800; 118 (51)
Abstract
Short-term forecasts of traditional streams from public health reporting (such as cases, hospitalizations, and deaths) are a key input to public health decision-making during a pandemic. Since early 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity in the United States. This paper studies the utility of five such indicators-derived from deidentified medical insurance claims, self-reported symptoms from online surveys, and COVID-related Google search activity-from a forecasting perspective. For each indicator, we ask whether its inclusion in an autoregressive (AR) model leads to improved predictive accuracy relative to the same model excluding it. Such an AR model, without external features, is already competitive with many top COVID-19 forecasting models in use today. Our analysis reveals that 1) inclusion of each of these five indicators improves on the overall predictive accuracy of the AR model; 2) predictive gains are in general most pronounced during times in which COVID cases are trending in "flat" or "down" directions; and 3) one indicator, based on Google searches, seems to be particularly helpful during "up" trends.
View details for DOI 10.1073/pnas.2111453118
View details for PubMedID 34903655
-
Author Correction: Genetics of 35 blood and urine biomarkers in the UK Biobank.
Nature genetics
2021
View details for DOI 10.1038/s41588-021-00956-2
View details for PubMedID 34608296
-
Rapid Screening of COVID-19 Directly from Clinical Nasopharyngeal Swabs Using the MasSpec Pen.
Analytical chemistry
2021
Abstract
The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to an ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information acquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding a cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4%, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI mass spectrometry (MS) allow fast (under a minute) screening of the COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.
View details for DOI 10.1021/acs.analchem.1c01937
View details for PubMedID 34432430
-
Testing for a Sweet Spot in Randomized Trials.
Medical decision making : an international journal of the Society for Medical Decision Making
2021: 272989X211025525
Abstract
INTRODUCTION: Randomized trials recruit diverse patients, including some individuals who may be unresponsive to the treatment. Here we follow up on prior conceptual advances and introduce a specific method that does not rely on stratification analysis and that tests whether patients in the intermediate range of disease severity experience more relative benefit than patients at the extremes of disease severity (sweet spot).METHODS: We contrast linear models to sigmoidal models when describing associations between disease severity and accumulating treatment benefit. The Gompertz curve is highlighted as a specific sigmoidal curve along with the Akaike information criterion (AIC) as a measure of goodness of fit. This approach is then applied to a matched analysis of a published landmark randomized trial evaluating whether implantable defibrillators reduce overall mortality in cardiac patients (n = 2,521).RESULTS: The linear model suggested a significant survival advantage across the spectrum of increasing disease severity (beta = 0.0847, P < 0.001, AIC = 2,491). Similarly, the sigmoidal model suggested a significant survival advantage across the spectrum of disease severity (alpha = 93, beta = 4.939, gamma = 0.00316, P < 0.001 for all, AIC = 1,660). The discrepancy between the 2 models indicated worse goodness of fit with a linear model compared to a sigmoidal model (AIC: 2,491 v. 1,660, P < 0.001), thereby suggesting a sweet spot in the midrange of disease severity. Model cross-validation using computational statistics also confirmed the superior goodness of fit of the sigmoidal curve with a concentration of survival benefits for patients in the midrange of disease severity.CONCLUSION: Systematic methods are available beyond simple stratification for identifying a sweet spot according to disease severity. The approach can assess whether some patients experience more relative benefit than other patients in a randomized trial.[Box: see text].
View details for DOI 10.1177/0272989X211025525
View details for PubMedID 34378458
-
Author Correction: An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging.
Nature aging
2021; 1 (8): 748
View details for DOI 10.1038/s43587-021-00102-x
View details for PubMedID 37117770
-
Penalized regression for left-truncated and right-censored survival data.
Statistics in medicine
2021
Abstract
High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data are also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database to highlight the pitfalls of failing to account for left truncation in survival modeling.
View details for DOI 10.1002/sim.9136
View details for PubMedID 34302373
-
The stanford prostate cancer calculator: Development and external validation of online nomograms incorporating PIRADS scores to predict clinically significant prostate cancer.
Urologic oncology
2021
Abstract
BACKGROUND: While multiparametric MRI (mpMRI) has high sensitivity for detection of clinically significant prostate cancer (CSC), false positives and negatives remain common. Calculators that combine mpMRI with clinical variables can improve cancer risk assessment, while providing more accurate predictions for individual patients. We sought to create and externally validate nomograms incorporating Prostate Imaging Reporting and Data System (PIRADS) scores and clinical data to predict the presence of CSC in men of all biopsy backgrounds.METHODS: Data from 2125 men undergoing mpMRI and MR fusion biopsy from 2014 to 2018 at Stanford, Yale, and UAB were prospectively collected. Clinical data included age, race, PSA, biopsy status, PIRADS scores, and prostate volume. A nomogram predicting detection of CSC on targeted or systematic biopsy was created.RESULTS: Biopsy history, Prostate Specific Antigen (PSA) density, PIRADS score of 4 or 5, Caucasian race, and age were significant independent predictors. Our nomogram-the Stanford Prostate Cancer Calculator (SPCC)-combined these factors in a logistic regression to provide stronger predictive accuracy than PSA density or PIRADS alone. Validation of the SPCC using data from Yale and UAB yielded robust AUC values.CONCLUSIONS: The SPCC combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds. The SPCC demonstrates strong external generalizability with successful validation in two separate institutions. The calculator is available as a free web-based tool that can direct real-time clinical decision-making.
View details for DOI 10.1016/j.urolonc.2021.06.004
View details for PubMedID 34247909
-
Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.
Biostatistics (Oxford, England)
2021
View details for DOI 10.1093/biostatistics/kxab019
View details for PubMedID 34269393
-
An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging.
Nature aging
2021; 1: 598-615
Abstract
While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8-96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes.
View details for DOI 10.1038/s43587-021-00082-y
View details for PubMedID 34888528
-
Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.
Bioinformatics (Oxford, England)
2021
Abstract
MOTIVATION: Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data.RESULTS: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10minutes and using less than 32GB of memory.AVAILABILITY: https://github.com/rivas-lab/snpnet/tree/compact.
View details for DOI 10.1093/bioinformatics/btab452
View details for PubMedID 34146108
-
Assessment of heterogeneous treatment effect estimation accuracy via matching.
Statistics in medicine
2021
Abstract
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.
View details for DOI 10.1002/sim.9010
View details for PubMedID 33915600
-
Principal component-guided sparse regression
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE
2021
View details for DOI 10.1002/cjs.11617
View details for Web of Science ID 000640651700001
-
LassoNet: Neural Networks with Feature Sparsity.
Proceedings of machine learning research
2021; 130: 10-18
Abstract
Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ 1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by allowing a feature to participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. In experiments with real and simulated data, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.
View details for PubMedID 36092461
View details for PubMedCentralID PMC9453696
-
Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression.
Bioinformatics (Oxford, England)
2021
Abstract
MOTIVATION: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data.RESULTS: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020).AVAILABILITY: https://github.com/rivas-lab/multisnpnet-Cox.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btab095
View details for PubMedID 33560296
-
Basophil activation tests identify a peanut OIT subgroup with improved safety and outcomes
MOSBY-ELSEVIER. 2021: AB166
View details for Web of Science ID 000629158000529
-
Genetics of 35 blood and urine biomarkers in the UK Biobank.
Nature genetics
2021
Abstract
Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n=363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n=135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.
View details for DOI 10.1038/s41588-020-00757-z
View details for PubMedID 33462484
-
An open repository of real-time COVID-19 indicators.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (51)
Abstract
The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.
View details for DOI 10.1073/pnas.2111452118
View details for PubMedID 34903654
-
MassExplorer: a computational tool for analyzing desorption electrospray ionization mass spectrometry data
Bioinformatics (Oxford, England)
2021
Abstract
High-throughput gene expression can be used to address a wide range of fundamental biological problems, but datasets of an appropriate size are often unavailable. Moreover, existing transcriptomics simulators have been criticised because they fail to emulate key properties of gene expression data. In this paper, we develop a method based on a conditional generative adversarial network to generate realistic transcriptomics data for E. coli and humans. We assess the performance of our approach across several tissues and cancer types.We show that our model preserves several gene expression properties significantly better than widely used simulators such as SynTReN or GeneNetWeaver. The synthetic data preserves tissue and cancer-specific properties of transcriptomics data. Moreover, it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way.Code is available at: https://github.com/rvinas/adversarial-gene-expression.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btab282
View details for PubMedID 34009252
-
LassoNet: Neural Networks with Feature Sparsity
MICROTOME PUBLISHING. 2021: 10-+
View details for Web of Science ID 000659893800002
-
De novo mutational signature discovery in tumor genomes using SparseSignatures.
PLoS computational biology
2021; 17 (6): e1009119
Abstract
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.
View details for DOI 10.1371/journal.pcbi.1009119
View details for PubMedID 34181655
-
LassoNet: A Neural Network with Feature Sparsity
JOURNAL OF MACHINE LEARNING RESEARCH
2021; 22
View details for Web of Science ID 000663180800001
-
Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron
INTERNATIONAL STATISTICAL REVIEW
2020; 88: S73–S74
View details for DOI 10.1111/insr.12414
View details for Web of Science ID 000603161400008
-
Reluctant Generalised Additive Modelling.
International statistical review = Revue internationale de statistique
2020; 88 (Suppl 1): S205-S224
Abstract
Sparse generalised additive models (GAMs) are an extension of sparse generalised linear models that allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modelling, we propose a multi-stage algorithm, called reluctant generalised additive modelling (RGAM), that can fit sparse GAMs at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.
View details for DOI 10.1111/insr.12429
View details for PubMedID 36062079
View details for PubMedCentralID PMC9435322
-
Reluctant Generalised Additive Modelling
INTERNATIONAL STATISTICAL REVIEW
2020
View details for DOI 10.1111/insr.12429
View details for Web of Science ID 000591285600001
-
Metabolic Dynamics and Prediction of Gestational Age and Time to Delivery in Pregnant Women
OBSTETRICAL & GYNECOLOGICAL SURVEY
2020; 75 (11): 649–51
View details for DOI 10.1097/OGX.0000000000000864
View details for Web of Science ID 000594473400001
-
Rejoinder: Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons
STATISTICAL SCIENCE
2020; 35 (4): 625–26
View details for DOI 10.1214/20-STS733REJ
View details for Web of Science ID 000591728200007
-
Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons
STATISTICAL SCIENCE
2020; 35 (4): 579–92
View details for DOI 10.1214/19-STS733
View details for Web of Science ID 000591728200002
-
Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions
NATURE MACHINE INTELLIGENCE
2020
View details for DOI 10.1038/s42256-020-00232-8
View details for Web of Science ID 000579336000001
-
Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions.
Nature machine intelligence
2020; 2 (10): 619-628
Abstract
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.
View details for DOI 10.1038/s42256-020-00232-8
View details for PubMedID 33294774
View details for PubMedCentralID PMC7720904
-
Transparency and reproducibility in artificial intelligence.
Nature
2020; 586 (7829): E14–E16
View details for DOI 10.1038/s41586-020-2766-y
View details for PubMedID 33057217
-
A Pliable Lasso.
Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
2020; 29 (1): 215-225
Abstract
We propose a generalization of the lasso that allows the model coefficients to vary as a function of a general set of some prespecified modifying variables. These modifiers might be variables such as gender, age, or time. The paradigm is quite general, with each lasso coefficient modified by a sparse linear function of the modifying variables Z. The model is estimated in a hierarchical fashion to control the degrees of freedom and avoid overfitting. The modifying variables may be observed, observed only in the training set, or unobserved overall. There are connections of our proposal to varying coefficient models and high-dimensional interaction models. We present a computationally efficient algorithm for its optimization, with exact screening rules to facilitate application to large numbers of predictors. The method is illustrated on a number of different simulated and real examples. Supplementary materials for this article are available online.
View details for DOI 10.1080/10618600.2019.1648271
View details for PubMedID 36340327
View details for PubMedCentralID PMC9631466
-
Post model-fitting exploration via a "Next-Door" analysis.
The Canadian journal of statistics = Revue canadienne de statistique
2020; 48 (3): 447-470
Abstract
We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the chosen "base model," and compares the error rates of the base model with that of nearby models. If the deletion of a predictor leads to significant deterioration in the model's predictive power, the predictor is called indispensable; otherwise, the nearby model is called acceptable and can serve as a good alternative to the base model. This provides both an assessment of the predictive contribution of each variable and a set of alternative models that may be used in place of the chosen model. We call this procedure "Next-Door analysis" since it examines models "next" to the base model. It can be applied to supervised learning problems with ℓ 1 penalization and stepwise procedures. We have implemented it in the R language as a library to accompany the well-known glmnet library.
View details for DOI 10.1002/cjs.11542
View details for PubMedID 36092475
View details for PubMedCentralID PMC9454156
-
SARS-CoV-2 Antibody Responses Correlate with Resolution of RNAemia But Are Short-Lived in Patients with Mild Illness.
medRxiv : the preprint server for health sciences
2020
Abstract
SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, could offer protective immunity, and may affect clinical outcomes of COVID-19 patients. We analyzed 625 serial plasma samples from 40 hospitalized COVID-19 patients and 170 SARS-CoV-2-infected outpatients and asymptomatic individuals. Severely ill patients developed significantly higher SARS-CoV-2-specific antibody responses than outpatients and asymptomatic individuals. The development of plasma antibodies was correlated with decreases in viral RNAemia, consistent with potential humoral immune clearance of virus. Using a novel competition ELISA, we detected antibodies blocking RBD-ACE2 interactions in 68% of inpatients and 40% of outpatients tested. Cross-reactive antibodies recognizing SARS-CoV RBD were found almost exclusively in hospitalized patients. Outpatient and asymptomatic individuals' serological responses to SARS-CoV-2 decreased within 2 months, suggesting that humoral protection may be short-lived.
View details for DOI 10.1101/2020.08.15.20175794
View details for PubMedID 32839786
View details for PubMedCentralID PMC7444305
-
Transcriptional changes in peanut-specific CD4+ T cells over the course of oral immunotherapy.
Clinical immunology (Orlando, Fla.)
2020: 108568
Abstract
Oral immunotherapy (OIT) can successfully desensitize allergic individuals to offending foods such as peanut. Our recent clinical trial (NCT02103270) of peanut OIT allowed us to monitor peanut-specific CD4+ T cells, using MHC-peptide Dextramers, over the course of OIT. We used a single-cell targeted RNAseq assay to analyze these cells at 0, 12, 24, 52, and 104 weeks of OIT. We found a transient increase in TGFbeta-producing cells at 52 weeks in those with successful desensitization, which lasted until 117 weeks. We also performed clustering and identified 5 major clusters of Dextramer+ cells, which we tracked over time. One of these clusters appeared to be anergic, while another was consistent with recently described TFH13 cells. The other 3 clusters appeared to be Th2 cells by their coordinated production of IL-4 and IL-13, but they varied in their expression of STAT signaling proteins and other markers. A cluster with high expression of STAT family members also showed a possible transient increase at week 24 in those with successful desensitization. Single cell TCRalphabeta repertoire sequences were too diverse to track clones over time. Together with increased TGFbeta production, these changes may be mechanistic predictors of successful OIT that should be further investigated.
View details for DOI 10.1016/j.clim.2020.108568
View details for PubMedID 32783912
-
Molecular Transducers of Physical Activity Consortium (MoTrPAC): Mapping the Dynamic Responses to Exercise.
Cell
2020; 181 (7): 1464–74
Abstract
Exercise provides a robust physiological stimulus that evokes cross-talk among multiple tissues that when repeated regularly (i.e., training) improves physiological capacity, benefits numerous organ systems, and decreases the risk for premature mortality. However, a gap remains in identifying the detailed molecular signals induced by exercise that benefits health and prevents disease. The Molecular Transducers of Physical Activity Consortium (MoTrPAC) was established to address this gap and generate a molecular map of exercise. Preclinical and clinical studies will examine the systemic effects of endurance and resistance exercise across a range of ages and fitness levels by molecular probing of multiple tissues before and after acute and chronic exercise. From this multi-omic and bioinformatic analysis, a molecular map of exercise will be established. Altogether, MoTrPAC will provide a public database that is expected to enhance our understanding of the health benefits of exercise and to provide insight into how physical activity mitigates disease.
View details for DOI 10.1016/j.cell.2020.06.004
View details for PubMedID 32589957
-
Metabolic Dynamics and Prediction of Gestational Age and Time to Delivery in Pregnant Women.
Cell
2020; 181 (7): 1680
Abstract
Metabolism during pregnancy is a dynamic and precisely programmed process, the failure of which can bring devastating consequences to the mother and fetus. To define a high-resolution temporal profile of metabolites during healthy pregnancy, we analyzed the untargeted metabolome of 784weekly blood samples from 30 pregnant women. Broad changes and a highly choreographed profile were revealed: 4,995 metabolic features (of 9,651 total), 460 annotated compounds (of 687 total), and 34 human metabolic pathways (of 48 total) were significantly changed during pregnancy. Using linear models, we built a metabolic clock with five metabolites that time gestational age in high accordance with ultrasound (R= 0.92). Furthermore, two to three metabolites can identify when labor occurs (time to delivery within two, four, and eight weeks, AUROC ≥ 0.85). Our study represents a weekly characterization of the human pregnancy metabolome, providing a high-resolution landscape for understanding pregnancy with potential clinical utilities.
View details for DOI 10.1016/j.cell.2020.05.002
View details for PubMedID 32589958
-
Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2020; 115 (530): 665–66
View details for DOI 10.1080/01621459.2020.1762617
View details for Web of Science ID 000538423300016
-
Integrating genomic features for non-invasive early lung cancer detection.
Nature
2020; 580 (7802): 245-251
Abstract
Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)5, a method for the analysis of circulating tumour DNA (ctDNA), to better facilitate screening applications. We show that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic. We also find that the majority of somatic mutations in the cell-free DNA (cfDNA) of patients with lung cancer and of risk-matched controls reflect clonal haematopoiesis and are non-recurrent. Compared with tumour-derived mutations, clonal haematopoiesis mutations occur on longer cfDNA fragments and lack mutational signatures that are associated with tobacco smoking. Integrating these findings with other molecular features, we develop and prospectively validate a machine-learning method termed 'lung cancer likelihood in plasma' (Lung-CLiP), which can robustly discriminate early-stage lung cancer patients from risk-matched controls. This approach achieves performance similar to that of tumour-informed ctDNA detection and enables tuning of assay specificity in order to facilitate distinct clinical applications. Our findings establish the potential of cfDNA for lung cancer screening and highlight the importance of risk-matching cases and controls in cfDNA-based screening studies.
View details for DOI 10.1038/s41586-020-2140-0
View details for PubMedID 32269342
-
Integrating genomic features for non-invasive early lung cancer detection
NATURE
2020
View details for DOI 10.1038/s41586-020-2140-0
View details for Web of Science ID 000521531000011
-
Post model-fitting exploration via a "Next-Door" analysis
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE
2020
View details for DOI 10.1002/cjs.11542
View details for Web of Science ID 000561683100001
-
Dose-related Allergic Reactions Decrease Over Time During Peanut Oral Immunotherapy in a Large, Randomized, Double-blind, Placebo-controlled, Phase 2 Study
MOSBY-ELSEVIER. 2020: AB134
View details for Web of Science ID 000517092700425
-
Sustained outcomes in oral immunotherapy for peanut allergy (POISED study): a large, randomised, double-blind, placebo-controlled, phase 2 study
MOSBY-ELSEVIER. 2020: AB181
View details for Web of Science ID 000517092700578
-
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.
PLoS genetics
2020; 16 (10): e1009141
Abstract
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
View details for DOI 10.1371/journal.pgen.1009141
View details for PubMedID 33095761
-
Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.
Biostatistics (Oxford, England)
2020
Abstract
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.
View details for DOI 10.1093/biostatistics/kxaa038
View details for PubMedID 32989444
-
Origins and clonal convergence of gastrointestinal IgE+ B cells in human peanut allergy.
Science immunology
2020; 5 (45)
Abstract
B cells in human food allergy have been studied predominantly in the blood. Little is known about IgE+ B cells or plasma cells in tissues exposed to dietary antigens. We characterized IgE+ clones in blood, stomach, duodenum, and esophagus of 19 peanut-allergic patients, using high-throughput DNA sequencing. IgE+ cells in allergic patients are enriched in stomach and duodenum, and have a plasma cell phenotype. Clonally related IgE+ and non-IgE-expressing cell frequencies in tissues suggest local isotype switching, including transitions between IgA and IgE isotypes. Highly similar antibody sequences specific for peanut allergen Ara h 2 are shared between patients, indicating that common immunoglobulin genetic rearrangements may contribute to pathogenesis. These data define the gastrointestinal tract as a reservoir of IgE+ B lineage cells in food allergy.
View details for DOI 10.1126/sciimmunol.aay4209
View details for PubMedID 32139586
-
Increased diversity of gut microbiota during active oral immunotherapy in peanut allergic adults.
Allergy
2020
View details for DOI 10.1111/all.14540
View details for PubMedID 32750160
-
Defining the features and duration of antibody responses to SARS-CoV-2 infection associated with disease severity and outcome.
Science immunology
2020; 5 (54)
Abstract
SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, can neutralize the virus. It is, however, unknown which features of the serological response may affect clinical outcomes of COVID-19 patients. We analyzed 983 longitudinal plasma samples from 79 hospitalized COVID-19 patients and 175 SARS-CoV-2-infected outpatients and asymptomatic individuals. Within this cohort, 25 patients died of their illness. Higher ratios of IgG antibodies targeting S1 or RBD domains of spike compared to nucleocapsid antigen were seen in outpatients who had mild illness versus severely ill patients. Plasma antibody increases correlated with decreases in viral RNAemia, but antibody responses in acute illness were insufficient to predict inpatient outcomes. Pseudovirus neutralization assays and a scalable ELISA measuring antibodies blocking RBD-ACE2 interaction were well correlated with patient IgG titers to RBD. Outpatient and asymptomatic individuals' SARS-CoV-2 antibodies, including IgG, progressively decreased during observation up to five months post-infection.
View details for DOI 10.1126/sciimmunol.abe0240
View details for PubMedID 33288645
-
Identification of Diagnostic Metabolic Signatures in Clear Cell Renal Cell Carcinoma Using Mass Spectrometry Imaging.
International journal of cancer
2019
Abstract
Clear cell renal cell carcinoma (ccRCC) is the most common and lethal subtype of kidney cancer. Intraoperative frozen section (IFS) analysis is used to confirm the diagnosis during partial nephrectomy (PN). However, surgical margin evaluation using IFS analysis is time consuming and unreliable, leading to relatively low utilization. In this study, we demonstrated the use of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) as a molecular diagnostic and prognostic tool for ccRCC. DESI-MSI was conducted on fresh-frozen 23 normal-tumor paired nephrectomy specimens of ccRCC. An independent validation cohort of 17 normal-tumor pairs were analyzed. DESI-MSI provides two-dimensional molecular images of tissues with mass spectra representing small metabolites, fatty acids, and lipids. These tissues were subjected to histopathologic evaluation. A set of metabolites that distinguish ccRCC from normal kidney were identified by performing least absolute shrinkage and selection operator (Lasso) and log-ratio Lasso analysis. Lasso analysis with leave-one-patient-out cross validation selected 57 peaks from over 27,000 metabolic features across 37,608 pixels obtained using DESI-MSI of ccRCC and normal tissues. Baseline Lasso of metabolites predicted the class of each tissue to be normal or cancerous tissue with an accuracy of 94% and 76%, respectively. Combining the baseline Lasso with the ratio of glucose to arachidonic acid could potentially reduce scan time and improve accuracy to identify normal (82%) and ccRCC (88%) tissue. DESI-MSI allows rapid detection of metabolites associated with normal and ccRCC with high accuracy. As this technology advances, it could be used for rapid intraoperative assessment of surgical margin status. This article is protected by copyright. All rights reserved.
View details for DOI 10.1002/ijc.32843
View details for PubMedID 31863456
-
Increased T Cell Differentiation and Cytolytic Function in Bangladeshi Compared to American Children.
Frontiers in immunology
2019; 10: 2239
Abstract
During the first 5 years of life, children are especially vulnerable to infection-related morbidity and mortality. Conversely, the Hygiene Hypothesis suggests that a lack of exposure to infectious agents early in life could explain the increasing incidence of allergies and autoimmunity in high-income countries. Understanding these phenomena, however, is hampered by a lack of comprehensive, direct immune monitoring in children with differing degrees of microbial exposure. Using mass cytometry, we provide an in-depth profile of the peripheral blood mononuclear cells (PBMCs) of children in regions at the extremes of exposure: the San Francisco Bay Area, USA and an economically poor district of Dhaka, Bangladesh. Despite variability in clinical health, functional characteristics of PBMCs were similar in Bangladeshi and American children at 1 year of age. However, by 2-3 years of age, Bangladeshi children's immune cells often demonstrated altered activation and cytokine production profiles upon stimulation with PMA-ionomycin, with an overall immune trajectory more in line with American adults. Conversely, immune responses in children from the US remained steady. Using principal component analysis, donor location, ethnic background, and cytomegalovirus infection status were found to account for some of the variation identified among samples. Within Bangladeshi 1-year-olds, stunting (as measured by height-for-age z-scores) was found to be associated with IL-8 and TGFβ expression in PMA-ionomycin stimulated samples. Combined, these findings provide important insights into the immune systems of children in high vs. low microbial exposure environments and suggest an important role for IL-8 and TGFβ in mitigating the microbial challenges faced by the Bangladeshi children.
View details for DOI 10.3389/fimmu.2019.02239
View details for PubMedID 31620139
View details for PubMedCentralID PMC6763580
-
Increased T Cell Differentiation and Cytolytic Function in Bangladeshi Compared to American Children
FRONTIERS IN IMMUNOLOGY
2019; 10
View details for DOI 10.3389/fimmu.2019.02239
View details for Web of Science ID 000487188500001
-
A Pliable Lasso
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2019
View details for DOI 10.1080/10618600.2019.1648271
View details for Web of Science ID 000485888100001
-
Early detection of unilateral ureteral obstruction by desorption electrospray ionization mass spectrometry.
Scientific reports
2019; 9 (1): 11007
Abstract
Desorption electrospray ionization mass spectrometry (DESI-MS) is an emerging analytical tool for rapid in situ assessment of metabolomic profiles on tissue sections without tissue pretreatment or labeling. We applied DESI-MS to identify candidate metabolic biomarkers associated with kidney injury at the early stage. DESI-MS was performed on sections of kidneys from 80 mice over a time course following unilateral ureteral obstruction (UUO) and compared to sham controls. A predictive model of renal damage was constructed using the LASSO (least absolute shrinkage and selection operator) method. Levels of lipid and small metabolites were significantly altered and glycerophospholipids comprised a significant fraction of altered species. These changes correlate with altered expression of lipid metabolic genes, with most genes showing decreased expression. However, rapid upregulation of PG(22:6/22:6) level appeared to be a hitherto unknown feature of the metabolic shift observed in UUO. Using LASSO and SAM (significance analysis of microarrays), we identified a set of well-measured metabolites that accurately predicted UUO-induced renal damage that was detectable by 12h after UUO, prior to apparent histological changes. Thus, DESI-MS could serve as a useful adjunct to histology in identifying renal damage and demonstrates early and broad changes in membrane associated lipids.
View details for DOI 10.1038/s41598-019-47396-x
View details for PubMedID 31358807
-
Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction.
Cell
2019
Abstract
Accurate prediction of long-term outcomes remains a challenge in the care of cancer patients. Due to the difficulty of serial tumor sampling, previous prediction tools have focused on pretreatment factors. However, emerging non-invasive diagnostics have increased opportunities for serial tumor assessments. We describe the Continuous Individualized Risk Index (CIRI), a method to dynamically determine outcome probabilities for individual patients utilizing risk predictors acquired over time. Similar to "win probability" models in other fields, CIRI provides a real-time probability by integrating risk assessments throughout a patient's course. Applying CIRI to patients with diffuse large B cell lymphoma, we demonstrate improved outcome prediction compared to conventional risk models. We demonstrate CIRI's broader utility in analogous models of chronic lymphocytic leukemia and breast adenocarcinoma and perform a proof-of-concept analysis demonstrating how CIRI could be used to develop predictive biomarkers for therapy selection. We envision thatdynamic risk assessment will facilitate personalized medicine and enable innovative therapeutic paradigms.
View details for DOI 10.1016/j.cell.2019.06.011
View details for PubMedID 31280963
-
Main Effects and Interactions in Mixed and Incomplete Data Frames
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2019
View details for DOI 10.1080/01621459.2019.1623041
View details for Web of Science ID 000475141200001
-
Log-ratio lasso: Scalable, sparse estimation for log-ratio models
BIOMETRICS
2019; 75 (2): 613–24
View details for DOI 10.1111/biom.12995
View details for Web of Science ID 000483730600028
-
Proliferation tracing with single-cell mass cytometry optimizes generation of stem cell memory-like T cells
NATURE BIOTECHNOLOGY
2019; 37 (3): 259-+
View details for DOI 10.1038/s41587-019-0033-2
View details for Web of Science ID 000460155900016
-
Shaping of infant B cell receptor repertoires by environmental factors and infectious disease.
Science translational medicine
2019; 11 (481)
Abstract
Antigenic exposures at epithelial sites in infancy and early childhood are thought to influence the maturation of humoral immunity and modulate the risk of developing immunoglobulin E (IgE)-mediated allergic disease. How different kinds of environmental exposures influence B cell isotype switching to IgE, IgG, or IgA, and the somatic mutation maturation of these antibody pools, is not fully understood. We sequenced antibody repertoires in longitudinal blood samples in a birth cohort from infancy through the first 3 years of life and found that, whereas IgG and IgA show linear increases in mutational maturation with age, IgM and IgD mutations are more closely tied to pathogen exposure. IgE mutation frequencies are primarily increased in children with impaired skin barrier conditions such as eczema, suggesting that IgE affinity maturation could provide a mechanistic link between epithelial barrier failure and allergy development.
View details for PubMedID 30814336
-
Reply to J. Wang et al.
Journal of clinical oncology : official journal of the American Society of Clinical Oncology
2019: JCO1801907
View details for PubMedID 30753108
-
Proliferation tracing with single-cell mass cytometry optimizes generation of stem cell memory-like T cells.
Nature biotechnology
2019
Abstract
Selective differentiation of naive T cells into multipotent T cells is of great interest clinically for the generation of cell-based cancer immunotherapies. Cellular differentiation depends crucially on division state and time. Here we adapt a dye dilution assay for tracking cell proliferative history through mass cytometry and uncouple division, time and regulatory protein expression in single naive human T cells during their activation and expansion in a complex ex vivo milieu. Using 23 markers, we defined groups of proteins controlled predominantly by division state or time and found that undivided cells account for the majority of phenotypic diversity. We next built a map of cell state changes during naive T-cell expansion. By examining cell signaling on this map, we rationally selected ibrutinib, a BTK and ITK inhibitor, and administered it before T cell activation to direct differentiation toward a T stem cell memory (TSCM)-like phenotype. This method for tracing cell fate across division states and time can be broadly applied for directing cellular differentiation.
View details for PubMedID 30742126
-
Desensitization rates to peanut protein during OIT among children, adolescents, and adults
MOSBY-ELSEVIER. 2019: AB245
View details for DOI 10.1016/j.jaci.2018.12.750
View details for Web of Science ID 000457771200738
-
Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy.
Bioinformatics (Oxford, England)
2019; 35 (1): 95–103
Abstract
Motivation: Multiple biological clocks govern a healthy pregnancy. These biological mechanisms produce immunologic, metabolomic, proteomic, genomic and microbiomic adaptations during the course of pregnancy. Modeling the chronology of these adaptations during full-term pregnancy provides the frameworks for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia.Results: We performed a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets included measurements from the immunome, transcriptome, microbiome, proteome and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net (EN) algorithm was used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets were combined into a single model. This model not only significantly increased predictive power by combining all datasets, but also revealed novel interactions between different biological modalities. Future work includes expansion of the cohort to preterm-enriched populations and in vivo analysis of immune-modulating interventions based on the mechanisms identified.Availability and implementation: Datasets and scripts for reproduction of results are available through: https://nalab.stanford.edu/multiomics-pregnancy/.Supplementary information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 30561547
-
Mapping lung cancer epithelial-mesenchymal transition states and trajectories with single-cell resolution.
Nature communications
2019; 10 (1): 5587
Abstract
Elucidating the spectrum of epithelial-mesenchymal transition (EMT) and mesenchymal-epithelial transition (MET) states in clinical samples promises insights on cancer progression and drug resistance. Using mass cytometry time-course analysis, we resolve lung cancer EMT states through TGFβ-treatment and identify, through TGFβ-withdrawal, a distinct MET state. We demonstrate significant differences between EMT and MET trajectories using a computational tool (TRACER) for reconstructing trajectories between cell states. In addition, we construct a lung cancer reference map of EMT and MET states referred to as the EMT-MET PHENOtypic STAte MaP (PHENOSTAMP). Using a neural net algorithm, we project clinical samples onto the EMT-MET PHENOSTAMP to characterize their phenotypic profile with single-cell resolution in terms of our in vitro EMT-MET analysis. In summary, we provide a framework to phenotypically characterize clinical samples in the context of in vitro EMT-MET findings which could help assess clinical relevance of EMT in cancer in future studies.
View details for DOI 10.1038/s41467-019-13441-6
View details for PubMedID 31811131
-
Sustained outcomes in oral immunotherapy for peanut allergy (POISED study): a large, randomised, double-blind, placebo-controlled, phase 2 study.
Lancet (London, England)
2019
Abstract
Dietary avoidance is recommended for peanut allergies. We evaluated the sustained effects of peanut allergy oral immunotherapy (OIT) in a randomised long-term study in adults and children.In this randomised, double-blind, placebo-controlled, phase 2 study, we enrolled participants at the Sean N Parker Center for Allergy and Asthma Research at Stanford University (Stanford, CA, USA) with peanut allergy aged 7-55 years with a positive result from a double-blind, placebo-controlled, food challenge (DBPCFC; ≤500 mg of peanut protein), a positive skin-prick test (SPT) result (≥5 mm wheal diameter above the negative control), and peanut-specific immunoglobulin (Ig)E concentration of more than 4 kU/L. Participants were randomly assigned (2·4:1·4:1) in a two-by-two block design via a computerised system to be built up and maintained on 4000 mg peanut protein through to week 104 then discontinued on peanut (peanut-0 group), to be built up and maintained on 4000 mg peanut protein through to week 104 then to ingest 300 mg peanut protein daily (peanut-300 group) for 52 weeks, or to receive oat flour (placebo group). DBPCFCs to 4000 mg peanut protein were done at baseline and weeks 104, 117, 130, 143, and 156. The pharmacist assigned treatment on the basis of a randomised computer list. Peanut or placebo (oat) flour was administered orally and participants and the study team were masked throughout by use of oat flour that was similar in look and feel to the peanut flour and nose clips, as tolerated, to mask taste. The statistician was also masked. The primary endpoint was the proportion of participants who passed DBPCFCs to a cumulative dose of 4000 mg at both 104 and 117 weeks. The primary efficacy analysis was done in the intention-to-treat population. Safety was assessed in the intention-to-treat population. This trial is registered at ClinicalTrials.gov, NCT02103270.Between April 15, 2014, and March 2, 2016, of 152 individuals assessed, we enrolled 120 participants, who were randomly assigned to the peanut-0 (n=60), peanut-300 (n=35), and placebo groups (n=25). 21 (35%) of peanut-0 group participants and one (4%) placebo group participant passed the 4000 mg challenge at both 104 and 117 weeks (odds ratio [OR] 12·7, 95% CI 1·8-554·8; p=0·0024). Over the entire study, the most common adverse events were mild gastrointestinal symptoms, which were seen in 90 of 120 patients (50/60 in the peanut-0 group, 29/35 in the peanut-300 group, and 11/25 in the placebo group) and skin disorders, which were seen in 50/120 patients (26/60 in the peanut-0 group, 15/35 in the peanut-300 group, and 9/25 in the placebo group). Adverse events decreased over time in all groups. Two participants in the peanut groups had serious adverse events during the 3-year study. In the peanut-0 group, in which eight (13%) of 60 participants passed DBPCFCs at week 156, higher baseline peanut-specific IgG4 to IgE ratio and lower Ara h 2 IgE and basophil activation responses were associated with sustained unresponsiveness. No treatment-related deaths occurred.Our study suggests that peanut OIT could desensitise individuals with peanut allergy to 4000 mg peanut protein but discontinuation, or even reduction to 300 mg daily, could increase the likelihood of regaining clinical reactivity to peanut. Since baseline blood tests correlated with week 117 treatment outcomes, this study might aid in optimal patient selection for this therapy.National Institute of Allergy and Infectious Diseases.
View details for DOI 10.1016/S0140-6736(19)31793-3
View details for PubMedID 31522849
-
Preoperative metabolic classification of thyroid nodules using mass spectrometry imaging of fine-needle aspiration biopsies.
Proceedings of the National Academy of Sciences of the United States of America
2019
Abstract
Thyroid neoplasia is common and requires appropriate clinical workup with imaging and fine-needle aspiration (FNA) biopsy to evaluate for cancer. Yet, up to 20% of thyroid nodule FNA biopsies will be indeterminate in diagnosis based on cytological evaluation. Genomic approaches to characterize the malignant potential of nodules showed initial promise but have provided only modest improvement in diagnosis. Here, we describe a method using metabolic analysis by desorption electrospray ionization mass spectrometry (DESI-MS) imaging for direct analysis and diagnosis of follicular cell-derived neoplasia tissues and FNA biopsies. DESI-MS was used to analyze 178 tissue samples to determine the molecular signatures of normal, benign follicular adenoma (FTA), and malignant follicular carcinoma (FTC) and papillary carcinoma (PTC) thyroid tissues. Statistical classifiers, including benign thyroid versus PTC and benign thyroid versus FTC, were built and validated with 114,125 mass spectra, with accuracy assessed in correlation with clinical pathology. Clinical FNA smears were prospectively collected and analyzed using DESI-MS imaging, and the performance of the statistical classifiers was tested with 69 prospectively collected clinical FNA smears. High performance was achieved for both models when predicting on the FNA test set, which included 24 nodules with indeterminate preoperative cytology, with accuracies of 93% and 89%. Our results strongly suggest that DESI-MS imaging is a valuable technology for identification of malignant potential of thyroid nodules.
View details for DOI 10.1073/pnas.1911333116
View details for PubMedID 31591199
-
An Approach to Explore for a Sweet-spot in Randomized Trials.
Journal of clinical epidemiology
2019
Abstract
To demonstrate how a conventional randomized trial can be analyzed through a stratified or a matched approach to identify a potential sweet-spot where observed differences seem accentuated in the mid range of disease severity.and Setting: We review a landmark randomized trial of heart failure patients that tested whether implantable defibrillators reduce mortality (n = 2,521).Overall, 22% (182 / 829) of the patients in the defibrillator group died compared to 29% (484 / 1,692) of patients in the control group. Proportional hazards analysis yielded a modest 25% survival benefit (hazard ratio = 0.75, 95% confidence interval: 0.63 to 0.89). Stratified analysis of the trial yielded a larger 52% survival benefit for those in the middle quintile of disease severity (hazard ratio = 0.48, 95% confidence interval: 0.29 to 0.79). In contrast, little of the survival benefit was explained by patients with the greatest disease severity (hazard ratio = 0.89, 95% confidence interval 0.69 to 1.15). The discrepancy between crude and stratified analyses could be visualized by graphical displays and replicated with matched comparisons.Our approach for analyzing a randomized trial could help identify a potential sweet-spot of an accentuated treatment effect.
View details for DOI 10.1016/j.jclinepi.2019.12.012
View details for PubMedID 31874202
-
Genomic analysis of benign prostatic hyperplasia implicates cellular re-landscaping in disease pathogenesis.
JCI insight
2019; 5
Abstract
Benign prostatic hyperplasia (BPH) is the most common cause of lower urinary tract symptoms in men. Current treatments target prostate physiology rather than BPH pathophysiology and are only partially effective. Here, we applied next-generation sequencing to gain new insight into BPH. By RNAseq, we uncovered transcriptional heterogeneity among BPH cases, where a 65-gene BPH stromal signature correlated with symptom severity. Stromal signaling molecules BMP5 and CXCL13 were enriched in BPH while estrogen regulated pathways were depleted. Notably, BMP5 addition to cultured prostatic myofibroblasts altered their expression profile towards a BPH profile that included the BPH stromal signature. RNAseq also suggested an altered cellular milieu in BPH, which we verified by immunohistochemistry and single-cell RNAseq. In particular, BPH tissues exhibited enrichment of myofibroblast subsets, whilst depletion of neuroendocrine cells and an estrogen receptor (ESR1)-positive fibroblast cell type residing near epithelium. By whole-exome sequencing, we uncovered somatic single-nucleotide variants (SNVs) in BPH, of uncertain pathogenic significance but indicative of clonal cell expansions. Thus, genomic characterization of BPH has identified a clinically-relevant stromal signature and new candidate disease pathways (including a likely role for BMP5 signaling), and reveals BPH to be not merely a hyperplasia, but rather a fundamental re-landscaping of cell types.
View details for DOI 10.1172/jci.insight.129749
View details for PubMedID 31094703
-
Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy
BIOINFORMATICS
2019; 35 (1): 95–103
View details for DOI 10.1093/bioinformatics/bty537
View details for Web of Science ID 000459313900012
-
Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball
STATISTICAL MODELLING
2018; 18 (5-6): 388–410
View details for DOI 10.1177/1471082X18777669
View details for Web of Science ID 000452266900002
-
Found In Translation: a machine learning model for mouse-to-human inference.
Nature methods
2018
Abstract
Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; http://www.mouse2man.org ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20-50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost.
View details for PubMedID 30478323
-
Analyzing Excess Risk from Matched Designs with Double Controls: Author response.
Journal of clinical epidemiology
2018
View details for PubMedID 30453039
-
Log-ratio Lasso: Scalable, Sparse Estimation for Log-ratio Models.
Biometrics
2018
Abstract
Positive-valued signal data is common in the biological and medical sciences, due to the prevalence of mass spectrometry other imaging techniques. With such data, only the relative intensities of the raw measurements are meaningful. It is desirable to consider models consisting of the log-ratios of all pairs of the raw features, since log-ratios are the simplest meaningful derived features. In this case, however, the dimensionality of the predictor space becomes large, and computationally efficient estimation procedures are required. In this work, we introduce an embedding of the log-ratio parameter space into a space of much lower dimension and use this representation to develop an efficient penalized fitting procedure. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set, the proposed method fits a highly sparse model consisting of features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods. This article is protected by copyright. All rights reserved.
View details for PubMedID 30387139
-
Multicenter Study Using Desorption-Electrospray-Ionization-Mass-Spectrometry Imaging for Breast-Cancer Diagnosis
ANALYTICAL CHEMISTRY
2018; 90 (19): 11324–32
Abstract
The histological and molecular subtypes of breast cancer demand distinct therapeutic approaches. Invasive ductal carcinoma (IDC) is subtyped according to estrogen-receptor (ER), progesterone-receptor (PR), and HER2 status, among other markers. Desorption-electrospray-ionization-mass-spectrometry imaging (DESI-MSI) is an ambient-ionization MS technique that has been previously used to diagnose IDC. Aiming to investigate the robustness of ambient-ionization MS for IDC diagnosis and subtyping over diverse patient populations and interlaboratory use, we report a multicenter study using DESI-MSI to analyze samples from 103 patients independently analyzed in the United States and Brazil. The lipid profiles of IDC and normal breast tissues were consistent across different patient races and were unrelated to country of sample collection. Similar experimental parameters used in both laboratories yielded consistent mass-spectral data in mass-to-charge ratios ( m/ z) above 700, where complex lipids are observed. Statistical classifiers built using data acquired in the United States yielded 97.6% sensitivity, 96.7% specificity, and 97.6% accuracy for cancer diagnosis. Equivalent performance was observed for the intralaboratory validation set (99.2% accuracy) and, most remarkably, for the interlaboratory validation set independently acquired in Brazil (95.3% accuracy). Separate classification models built for ER and PR statuses as well as the status of their combined hormone receptor (HR) provided predictive accuracies (>89.0%), although low classification accuracies were achieved for HER2 status. Altogether, our multicenter study demonstrates that DESI-MSI is a robust and reproducible technology for rapid breast-cancer-tissue diagnosis and therefore is of value for clinical use.
View details for PubMedID 30170496
-
Circulating Tumor DNA Measurements As Early Outcome Predictors in Diffuse Large B-Cell Lymphoma.
Journal of clinical oncology : official journal of the American Society of Clinical Oncology
2018: JCO2018785246
Abstract
Purpose Outcomes for patients with diffuse large B-cell lymphoma remain heterogeneous, with existing methods failing to consistently predict treatment failure. We examined the additional prognostic value of circulating tumor DNA (ctDNA) before and during therapy for predicting patient outcomes. Patients and Methods We studied the dynamics of ctDNA from 217 patients treated at six centers, using a training and validation framework. We densely characterized early ctDNA dynamics during therapy using cancer personalized profiling by deep sequencing to define response-associated thresholds within a discovery set. These thresholds were assessed in two independent validation sets. Finally, we assessed the prognostic value of ctDNA in the context of established risk factors, including the International Prognostic Index and interim positron emission tomography/computed tomography scans. Results Before therapy, ctDNA was detectable in 98% of patients; pretreatment levels were prognostic in both front-line and salvage settings. In the discovery set, ctDNA levels changed rapidly, with a 2-log decrease after one cycle (early molecular response [EMR]) and a 2.5-log decrease after two cycles (major molecular response [MMR]) stratifying outcomes. In the first validation set, patients receiving front-line therapy achieving EMR or MMR had superior outcomes at 24 months (EMR: EFS, 83% v 50%; P = .0015; MMR: EFS, 82% v 46%; P < .001). EMR also predicted superior 24-month outcomes in patients receiving salvage therapy in the first validation set (EFS, 100% v 13%; P = .011). The prognostic value of EMR and MMR was further confirmed in the second validation set. In multivariable analyses including International Prognostic Index and interim positron emission tomography/computed tomography scans across both cohorts, molecular response was independently prognostic of outcomes, including event-free and overall survival. Conclusion Pretreatment ctDNA levels and molecular responses are independently prognostic of outcomes in aggressive lymphomas. These risk factors could potentially guide future personalized risk-directed approaches.
View details for PubMedID 30125215
-
Development of plasma cell-free DNA (cfDNA) assays for early cancer detection: first insights from the Circulating Cell-Free Genome Atlas Study (CCGA)
AMER ASSOC CANCER RESEARCH. 2018
View details for DOI 10.1158/1538-7445.AM2018-LB-343
View details for Web of Science ID 000468818900480
-
Supervised learning via the "hubNet" procedure.
Statistica Sinica
2018; 28 (3): 1225-1243
Abstract
We propose a new method for supervised learning. The hubNet procedure fits a hub-based graphical model to the predictors, to estimate the amount of "connection" that each predictor has with other predictors. This yields a set of predictor weights that are then used in a regularized regression such as the lasso or elastic net. The resulting procedure is easy to implement, can often yield higher or competitive prediction accuracy with fewer features than the lasso, and can give insight into the underlying structure of the predictors. HubNet can be generalized seamlessly to supervised problems such as regularized logistic regression (and other GLMs), Cox's proportional hazards model, and nonlinear procedures such as random forests and boosting. We prove recovery results under a specialized model and illustrate the method on real and simulated data.
View details for DOI 10.5705/ss.202016.0482
View details for PubMedID 35677806
View details for PubMedCentralID PMC9173714
-
SUPERVISED LEARNING VIA THE "HUBNET" PROCEDURE
STATISTICA SINICA
2018; 28 (3): 1225–43
View details for DOI 10.5705/ss.202016.0482
View details for Web of Science ID 000450215100006
-
Pharmacogenetics and progression to neovascular age-relatedmacular degeneration-Evidence supporting practice change REPLY
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2018; 115 (25): E5640–E5641
View details for PubMedID 29880713
-
Noninvasive blood tests for fetal development predict gestational age and preterm delivery
SCIENCE
2018; 360 (6393): 1133–36
Abstract
Noninvasive blood tests that provide information about fetal development and gestational age could potentially improve prenatal care. Ultrasound, the current gold standard, is not always affordable in low-resource settings and does not predict spontaneous preterm birth, a leading cause of infant death. In a pilot study of 31 healthy pregnant women, we found that measurement of nine cell-free RNA (cfRNA) transcripts in maternal blood predicted gestational age with comparable accuracy to ultrasound but at substantially lower cost. In a related study of 38 women (23 full-term and 15 preterm deliveries), all at elevated risk of delivering preterm, we identified seven cfRNA transcripts that accurately classified women who delivered preterm up to 2 months in advance of labor. These tests hold promise for prenatal care in both the developed and developing worlds, although they require validation in larger, blinded clinical trials.
View details for PubMedID 29880692
-
Methods for analyzing matched designs with double controls: excess risk is easily estimated and misinterpreted when evaluating traffic deaths
JOURNAL OF CLINICAL EPIDEMIOLOGY
2018; 98: 117–22
Abstract
To demonstrate analytic approaches for matched studies where two controls are linked to each case and events are accumulating counts rather than binary outcomes. A secondary intent is to clarify the distinction between total risk and excess risk (unmatched vs. matched perspectives).We review past research testing whether elections can lead to increased traffic risks. The results are reinterpreted by analyzing both the total count of individuals in fatal crashes and the excess count of individuals in fatal crashes, each time accounting for the matched double controls.Overall, 1,546 individuals were in fatal crashes on the 10 election days (average = 155/d), and 2,593 individuals were in fatal crashes on the 20 control days (average = 130/d). Poisson regression of total counts yielded a relative risk of 1.19 (95% confidence interval: 1.12-1.27). Poisson regression of excess counts yielded a relative risk of 3.22 (95% confidence interval: 2.72-3.80). The discrepancy between analyses of total counts and excess counts replicated with alternative statistical models and was visualized in graphical displays.Available approaches provide methods for analyzing count data in matched designs with double controls and help clarify the distinction between increases in total risk and increases in excess risk.
View details for PubMedID 29452220
-
Some methods for heterogeneous treatment effect estimation in high dimensions
STATISTICS IN MEDICINE
2018; 37 (11): 1767–87
Abstract
When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.
View details for PubMedID 29508417
View details for PubMedCentralID PMC5938172
-
Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse.
Nature medicine
2018; 24 (4): 474–83
Abstract
Insight into the cancer cell populations that are responsible for relapsed disease is needed to improve outcomes. Here we report a single-cell-based study of B cell precursor acute lymphoblastic leukemia at diagnosis that reveals hidden developmentally dependent cell signaling states that are uniquely associated with relapse. By using mass cytometry we simultaneously quantified 35 proteins involved in B cell development in 60 primary diagnostic samples. Each leukemia cell was then matched to its nearest healthy B cell population by a developmental classifier that operated at the single-cell level. Machine learning identified six features of expanded leukemic populations that were sufficient to predict patient relapse at diagnosis. These features implicated the pro-BII subpopulation of B cells with activated mTOR signaling, and the pre-BI subpopulation of B cells with activated and unresponsive pre-B cell receptor signaling, to be associated with relapse. This model, termed 'developmentally dependent predictor of relapse' (DDPR), significantly improves currently established risk stratification methods. DDPR features exist at diagnosis and persist at relapse. By leveraging a data-driven approach, we demonstrate the predictive value of single-cell 'omics' for patient stratification in a translational setting and provide a framework for its application to human cancer.
View details for PubMedID 29505032
-
Post-selection inference for 1-penalized likelihood models
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE
2018; 46 (1): 41–61
View details for DOI 10.1002/cjs.11313
View details for Web of Science ID 000425130100004
-
Post-Selection Inference for ℓ1-Penalized Likelihood Models.
The Canadian journal of statistics = Revue canadienne de statistique
2018; 46 (1): 41-61
Abstract
We present a new method for post-selection inference for ℓ1 (lasso)-penalized likelihood models, including generalized regression models. Our approach generalizes the post-selection framework presented in Lee et al. (2013). The method provides p-values and confidence intervals that are asymptotically valid, conditional on the inherent selection done by the lasso. We present applications of this work to (regularized) logistic regression, Cox's proportional hazards model and the graphical lasso. We do not provide rigorous proofs here of the claimed results, but rather conceptual and theoretical sketches.
View details for DOI 10.1002/cjs.11313
View details for PubMedID 30127543
View details for PubMedCentralID PMC6097808
-
Genomic Feature Selection by Coverage Design Optimization.
Journal of applied statistics
2018; 45 (14): 2658-2676
Abstract
We introduce a novel data reduction technique whereby we select a subset of tiles to "cover" maximally events of interest in large-scale biological datasets (e.g., genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations (> 1M).
View details for DOI 10.1080/02664763.2018.1432577
View details for PubMedID 30294060
View details for PubMedCentralID PMC6173524
-
CFH and ARMS2 genetic risk determines progression to neovascular age-related macular degeneration after antioxidant and zinc supplementation
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2018; 115 (4): E696–E704
Abstract
We evaluated the influence of an antioxidant and zinc nutritional supplement [the Age-Related Eye Disease Study (AREDS) formulation] on delaying or preventing progression to neovascular AMD (NV) in persons with age-related macular degeneration (AMD). AREDS subjects (n = 802) with category 3 or 4 AMD at baseline who had been treated with placebo or the AREDS formulation were evaluated for differences in the risk of progression to NV as a function of complement factor H (CFH) and age-related maculopathy susceptibility 2 (ARMS2) genotype groups. We used published genetic grouping: a two-SNP haplotype risk-calling algorithm to assess CFH, and either the single SNP rs10490924 or 372_815del443ins54 to mark ARMS2 risk. Progression risk was determined using the Cox proportional hazard model. Genetics-treatment interaction on NV risk was assessed using a multiiterative bootstrap validation analysis. We identified strong interaction of genetics with AREDS formulation treatment on the development of NV. Individuals with high CFH and no ARMS2 risk alleles and taking the AREDS formulation had increased progression to NV compared with placebo. Those with low CFH risk and high ARMS2 risk had decreased progression risk. Analysis of CFH and ARMS2 genotype groups from a validation dataset reinforces this conclusion. Bootstrapping analysis confirms the presence of a genetics-treatment interaction and suggests that individual treatment response to the AREDS formulation is largely determined by genetics. The AREDS formulation modifies the risk of progression to NV based on individual genetics. Its use should be based on patient-specific genotype.
View details for PubMedID 29311295
-
DRUG-NEM: Optimizing drug combinations using single-cell perturbation response to account for intratumoral heterogeneity.
Proceedings of the National Academy of Sciences of the United States of America
2018; 115 (18): E4294–E4303
Abstract
An individual malignant tumor is composed of a heterogeneous collection of single cells with distinct molecular and phenotypic features, a phenomenon termed intratumoral heterogeneity. Intratumoral heterogeneity poses challenges for cancer treatment, motivating the need for combination therapies. Single-cell technologies are now available to guide effective drug combinations by accounting for intratumoral heterogeneity through the analysis of the signaling perturbations of an individual tumor sample screened by a drug panel. In particular, Mass Cytometry Time-of-Flight (CyTOF) is a high-throughput single-cell technology that enables the simultaneous measurements of multiple ([Formula: see text]40) intracellular and surface markers at the level of single cells for hundreds of thousands of cells in a sample. We developed a computational framework, entitled Drug Nested Effects Models (DRUG-NEM), to analyze CyTOF single-drug perturbation data for the purpose of individualizing drug combinations. DRUG-NEM optimizes drug combinations by choosing the minimum number of drugs that produce the maximal desired intracellular effects based on nested effects modeling. We demonstrate the performance of DRUG-NEM using single-cell drug perturbation data from tumor cell lines and primary leukemia samples.
View details for PubMedID 29654148
-
Distinguishing malignant from benign microscopic skin lesions using desorption electrospray ionization mass spectrometry imaging.
Proceedings of the National Academy of Sciences of the United States of America
2018
Abstract
Detection of microscopic skin lesions presents a considerable challenge in diagnosing early-stage malignancies as well as in residual tumor interrogation after surgical intervention. In this study, we established the capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin. We analyzed 86 human specimens collected during Mohs micrographic surgery for BCC to cross-examine spatial distributions of numerous lipids and metabolites in BCC aggregates versus adjacent skin. Statistical analysis using the least absolute shrinkage and selection operation (Lasso) was employed to categorize each 200-µm-diameter picture element (pixel) of investigated skin tissue map as BCC or normal. Lasso identified 24 molecular ion signals, which are significant for pixel classification. These ion signals included lipids observed at m/z 200-1,200 and Krebs cycle metabolites observed at m/z < 200. Based on these features, Lasso yielded an overall 94.1% diagnostic accuracy pixel by pixel of the skin map compared with histopathological evaluation. We suggest that DESI-MSI/Lasso analysis can be employed as a complementary technique for delineation of microscopic skin tumors.
View details for PubMedID 29866838
-
A General Framework for Estimation and Inference From Clusters of Features
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2018; 113 (521): 280–93
View details for DOI 10.1080/01621459.2016.1246368
View details for Web of Science ID 000438960500030
-
Genomic feature selection by coverage design optimization
Journal of Applied Statistics
2018
View details for DOI 10.1080/02664763.2018.1432577
-
Food allergy and omics.
The Journal of allergy and clinical immunology
2018; 141 (1): 20–29
Abstract
Food allergy (FA) prevalence has been increasing over the last few decades and is now a global health concern. Current diagnostic methods for FA result in a high number of false-positive results, and the standard of care is either allergen avoidance or use of epinephrine on accidental exposure, although currently with no other approved treatments. The increasing prevalence of FA, lack of robust biomarkers, and inadequate treatments warrants further research into the mechanism underlying food allergies. Recent technological advances have made it possible to move beyond traditional biological techniques to more sophisticated high-throughput approaches. These technologies have created the burgeoning field of omics sciences, which permit a more systematic investigation of biological problems. Omics sciences, such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, microbiomics, and exposomics, have enabled the construction of regulatory networks and biological pathway models. Parallel advances in bioinformatics and computational techniques have enabled the integration, analysis, and interpretation of these exponentially growing data sets and opens the possibility of personalized or precision medicine for FA.
View details for PubMedID 29307411
-
KLHL6 Is Preferentially Expressed in Germinal Center-Derived B-Cell Lymphomas
AMERICAN JOURNAL OF CLINICAL PATHOLOGY
2017; 148 (6): 465–76
Abstract
KLHL6 is a recently described BTB-Kelch protein with selective expression in lymphoid tissues and is most strongly expressed in germinal center B cells.Using gene expression profiling as well as immunohistochemistry with an anti-KLHL6 monoclonal antibody, we have characterized the expression of this molecule in normal and neoplastic tissues. Protein expression was evaluated in 1,058 hematopoietic neoplasms.Consistent with its discovery as a germinal center marker, KLHL6 was positive mainly in B-cell neoplasms of germinal center derivation, including 95% of follicular lymphomas (106/112). B-cell lymphomas of non-germinal center derivation were generally negative (0/33 chronic lymphocytic leukemias/small lymphocytic lymphomas, 3/49 marginal zone lymphomas, and 2/66 mantle cell lymphomas).In addition to other germinal center markers, including BCL6, CD10, HGAL, and LMO2, KLHL6 immunohistochemistry may prove a useful adjunct in the diagnosis and future classification of B-cell lymphomas.
View details for PubMedID 29140403
-
SELECTING THE NUMBER OF PRINCIPAL COMPONENTS: ESTIMATION OF THE TRUE RANK OF A NOISY MATRIX
ANNALS OF STATISTICS
2017; 45 (6): 2590–2617
View details for DOI 10.1214/16-AOS1536
View details for Web of Science ID 000418371600011
-
Big data modeling to predict platelet usage and minimize wastage in a tertiary care system
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2017; 114 (43): 11368–73
Abstract
Maintaining a robust blood product supply is an essential requirement to guarantee optimal patient care in modern health care systems. However, daily blood product use is difficult to anticipate. Platelet products are the most variable in daily usage, have short shelf lives, and are also the most expensive to produce, test, and store. Due to the combination of absolute need, uncertain daily demand, and short shelf life, platelet products are frequently wasted due to expiration. Our aim is to build and validate a statistical model to forecast future platelet demand and thereby reduce wastage. We have investigated platelet usage patterns at our institution, and specifically interrogated the relationship between platelet usage and aggregated hospital-wide patient data over a recent consecutive 29-mo period. Using a convex statistical formulation, we have found that platelet usage is highly dependent on weekday/weekend pattern, number of patients with various abnormal complete blood count measurements, and location-specific hospital census data. We incorporated these relationships in a mathematical model to guide collection and ordering strategy. This model minimizes waste due to expiration while avoiding shortages; the number of remaining platelet units at the end of any day stays above 10 in our model during the same period. Compared with historical expiration rates during the same period, our model reduces the expiration rate from 10.5 to 3.2%. Extrapolating our results to the ∼2 million units of platelets transfused annually within the United States, if implemented successfully, our model can potentially save ∼80 million dollars in health care costs.
View details for PubMedID 29073058
-
Post-selection point and interval estimation of signal sizes in Gaussian samples
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE
2017; 45 (2): 128-148
View details for DOI 10.1002/cjs.11320
View details for Web of Science ID 000400027400001
-
Metabolic Markers and Statistical Prediction of Serous Ovarian Cancer Aggressiveness by Ambient Ionization Mass Spectrometry Imaging.
Cancer research
2017; 77 (11): 2903-2913
Abstract
Ovarian high-grade serous carcinoma (HGSC) results in the highest mortality among gynecological cancers, developing rapidly and aggressively. Dissimilarly, serous borderline ovarian tumors (BOT) can progress into low-grade serous carcinomas and have relatively indolent clinical behavior. The underlying biological differences between HGSC and BOT call for accurate diagnostic methodologies and tailored treatment options, and identification of molecular markers of aggressiveness could provide valuable biochemical insights and improve disease management. Here, we used desorption electrospray ionization (DESI) mass spectrometry (MS) to image and chemically characterize the metabolic profiles of HGSC, BOT, and normal ovarian tissue samples. DESI-MS imaging enabled clear visualization of fine papillary branches in serous BOT and allowed for characterization of spatial features of tumor heterogeneity such as adjacent necrosis and stroma in HGSC. Predictive markers of cancer aggressiveness were identified, including various free fatty acids, metabolites, and complex lipids such as ceramides, glycerophosphoglycerols, cardiolipins, and glycerophosphocholines. Classification models built from a total of 89,826 individual pixels, acquired in positive and negative ion modes from 78 different tissue samples, enabled diagnosis and prediction of HGSC and all tumor samples in comparison with normal tissues, with overall agreements of 96.4% and 96.2%, respectively. HGSC and BOT discrimination was achieved with an overall accuracy of 93.0%. Interestingly, our classification model allowed identification of three BOT samples presenting unusual histologic features that could be associated with the development of low-grade carcinomas. Our results suggest DESI-MS as a powerful approach for rapid serous ovarian cancer diagnosis based on altered metabolic signatures. Cancer Res; 77(11); 2903-13. ©2017 AACR.
View details for DOI 10.1158/0008-5472.CAN-16-3044
View details for PubMedID 28416487
-
Chemical Space Mimicry for Drug Discovery
JOURNAL OF CHEMICAL INFORMATION AND MODELING
2017; 57 (4): 875-882
Abstract
We describe a new library generation method, Machine-based Identification of Molecules Inside Characterized Space (MIMICS), that generates sets of molecules inspired by a text-based input. MIMICS-generated libraries were found to preserve distributions of properties while simultaneously increasing structural diversity. Newly identified MIMICS-generated compounds were found to be bioactive as inhibitors of specific components of the unfolded protein response (UPR) and the VEGFR2 pathway in cell-based assays, thus confirming the applicability of this methodology toward drug design applications. Wider application of MIMICS could facilitate the efficient utilization of chemical space.
View details for DOI 10.1021/acs.jcim.6b00754
View details for Web of Science ID 000400204900023
View details for PubMedID 28257191
-
Diagnosis of prostate cancer by desorption electrospray ionization mass spectrometric imaging of small metabolites and lipids
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2017; 114 (13): 3334-3339
Abstract
Accurate identification of prostate cancer in frozen sections at the time of surgery can be challenging, limiting the surgeon's ability to best determine resection margins during prostatectomy. We performed desorption electrospray ionization mass spectrometry imaging (DESI-MSI) on 54 banked human cancerous and normal prostate tissue specimens to investigate the spatial distribution of a wide variety of small metabolites, carbohydrates, and lipids. In contrast to several previous studies, our method included Krebs cycle intermediates (m/z <200), which we found to be highly informative in distinguishing cancer from benign tissue. Malignant prostate cells showed marked metabolic derangements compared with their benign counterparts. Using the "Least absolute shrinkage and selection operator" (Lasso), we analyzed all metabolites from the DESI-MS data and identified parsimonious sets of metabolic profiles for distinguishing between cancer and normal tissue. In an independent set of samples, we could use these models to classify prostate cancer from benign specimens with nearly 90% accuracy per patient. Based on previous work in prostate cancer showing that glucose levels are high while citrate is low, we found that measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5. After brief tissue preparation, the glucose/citrate ratio can be recorded on a tissue sample in 1 min or less, which is in sharp contrast to the 20 min or more required by histopathological examination of frozen tissue specimens.
View details for DOI 10.1073/pnas.1700677114
View details for Web of Science ID 000397607300049
View details for PubMedID 28292895
View details for PubMedCentralID PMC5380053
-
Landscape of monoallelic DNA accessibility in mouse embryonic stem cells and neural progenitor cells.
Nature genetics
2017; 49 (3): 377-386
Abstract
We developed an allele-specific assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) to genotype and profile active regulatory DNA across the genome. Using a mouse hybrid F1 system, we found that monoallelic DNA accessibility across autosomes was pervasive, developmentally programmed and composed of several patterns. Genetically determined accessibility was enriched at distal enhancers, but random monoallelically accessible (RAMA) elements were enriched at promoters and may act as gatekeepers of monoallelic mRNA expression. Allelic choice at RAMA elements was stable across cell generations and bookmarked through mitosis. RAMA elements in neural progenitor cells were biallelically accessible in embryonic stem cells but premarked with bivalent histone modifications; one allele was silenced during differentiation. Quantitative analysis indicated that allelic choice at the majority of RAMA elements is consistent with a stochastic process; however, up to 30% of RAMA elements may deviate from the expected pattern, suggesting a regulated or counting mechanism.
View details for DOI 10.1038/ng.3769
View details for PubMedID 28112738
-
Long-term course of patients with primary ocular adnexal MALT lymphoma: a large single-institution cohort study
BLOOD
2017; 129 (3): 324-332
Abstract
While Primary Ocular Adnexal MALT Lymphoma (POAML) is the most common orbital tumor, there are large gaps in knowledge of its natural history. We conducted a retrospective analysis of the largest reported cohort, consisting of 182 patients with POAML, diagnosed or treated at our institution to analyze long-term outcome, response to treatment, incidence and localization of relapse and transformation. The majority of patients (80%) presented with stage I disease. Overall, 84% of treated patients achieved a complete response after first-line therapy. In patients with stage I disease treated with radiation therapy (RT), doses ≥ 30.6Gy were associated with significantly better complete response rate (p=0.04) and progression free survival (PFS) at 5 and 10-year (p<0.0001). Median overall survival and PFS for all patients were 250 months (95% CI: 222 - upper limit not reached) and 134 months (95% CI: 87 - 198), respectively. Kaplan-Meier estimates for the PFS at 1, 5, and 10 years were 91.5% (95% CI: 86.1% - 94.9%), 68.5% (95% CI: 60.4% - 75.6%), and 50.9% (95% CI: 40.5% - 61.6%), respectively. In univariate analysis, age > 60 years, radiation dose, bilateral ocular involvement at presentation and advanced stage were significantly correlated with shorter PFS (p=0.006, p=0.0001, p=0.002 and p=0.0001, respectively). Multivariate analysis showed that age >60 years (HR= 2.44) and RT<30.6Gy (HR=4.17) were the only factors correlated with shorter PFS (p=0.01 and p=0.0003, respectively). We demonstrate that POAMLs harbor a persistent and ongoing risk for relapses, including in central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow up.
View details for DOI 10.1182/blood-2016-05-714584
View details for Web of Science ID 000396529800010
-
An immune clock of human pregnancy.
Science immunology
2017; 2 (15)
Abstract
The maintenance of pregnancy relies on finely tuned immune adaptations. We demonstrate that these adaptations are precisely timed, reflecting an immune clock of pregnancy in women delivering at term. Using mass cytometry, the abundance and functional responses of all major immune cell subsets were quantified in serial blood samples collected throughout pregnancy. Cell signaling-based Elastic Net, a regularized regression method adapted from the elastic net algorithm, was developed to infer and prospectively validate a predictive model of interrelated immune events that accurately captures the chronology of pregnancy. Model components highlighted existing knowledge and revealed previously unreported biology, including a critical role for the interleukin-2-dependent STAT5ab signaling pathway in modulating T cell function during pregnancy. These findings unravel the precise timing of immunological events occurring during a term pregnancy and provide the analytical framework to identify immunological deviations implicated in pregnancy-related pathologies.
View details for PubMedID 28864494
-
A simple method for analyzing matched designs with double controls: McNemar's test can be extended.
Journal of clinical epidemiology
2017; 81: 51-55.e2
Abstract
To introduce a new analytic approach for matched studies, where exactly two controls are linked to each case (double controls rather than solitary controls). The intent is to extend McNemar's test for one-to-two matching (instead of one-to-one matching) when evaluating binary predictors and outcomes.We review McNemar's approach for analyzing matched data, demonstrate the Mantel-Haenszel approach for integrating two overlapping McNemar's estimates, review conditional logistic regression as an alternative analytic approach, and introduce a new method that yields a visual display and easy verification.We illustrate the new approach with real data testing the association between overcast weather and the risk of a life-threatening traffic crash (n = 6,962). We show that results from the new approach agree closely with conditional logistic regression and are sufficiently simple as to be computed on a handheld calculator. We further validate the approach by conducting simulations when a positive association was predefined and when a null association was predefined.The new approach provides a feasible, simple, and efficient method for analyzing matched designs with double controls.
View details for DOI 10.1016/j.jclinepi.2016.08.006
View details for PubMedID 27565976
-
An Ordered Lasso and Sparse Time-Lagged Regression
TECHNOMETRICS
2016; 58 (4): 415-423
View details for DOI 10.1080/00401706.2015.1079245
View details for Web of Science ID 000386209500002
-
Long term course of patients with primary ocular adnexal malt lymphoma: a large single institution cohort study.
Blood
2016
Abstract
While Primary Ocular Adnexal MALT Lymphoma (POAML) is the most common orbital tumor, there are large gaps in knowledge of its natural history. We conducted a retrospective analysis of the largest reported cohort, consisting of 182 patients with POAML, diagnosed or treated at our institution to analyze long-term outcome, response to treatment, incidence and localization of relapse and transformation. The majority of patients (80%) presented with stage I disease. Overall, 84% of treated patients achieved a complete response after first-line therapy. In patients with stage I disease treated with radiation therapy (RT), doses ≥ 30.6Gy were associated with significantly better complete response rate (p=0.04) and progression free survival (PFS) at 5 and 10-year (p<0.0001). Median overall survival and PFS for all patients were 250 months (95% CI: 222 - upper limit not reached) and 134 months (95% CI: 87 - 198), respectively. Kaplan-Meier estimates for the PFS at 1, 5, and 10 years were 91.5% (95% CI: 86.1% - 94.9%), 68.5% (95% CI: 60.4% - 75.6%), and 50.9% (95% CI: 40.5% - 61.6%), respectively. In univariate analysis, age > 60 years, radiation dose, bilateral ocular involvement at presentation and advanced stage were significantly correlated with shorter PFS (p=0.006, p=0.0001, p=0.002 and p=0.0001, respectively). Multivariate analysis showed that age >60 years (HR= 2.44) and RT<30.6Gy (HR=4.17) were the only factors correlated with shorter PFS (p=0.01 and p=0.0003, respectively). We demonstrate that POAMLs harbor a persistent and ongoing risk for relapses, including in central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow up.
View details for PubMedID 27789481
-
High-dimensional regression adjustments in randomized experiments.
Proceedings of the National Academy of Sciences of the United States of America
2016
Abstract
We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.
View details for PubMedID 27791165
-
An Ordered Lasso and Sparse Time-Lagged Regression.
Technometrics : a journal of statistics for the physical, chemical, and engineering sciences
2016; 58 (4): 415-423
Abstract
We consider regression scenarios where it is natural to impose an order constraint on the coefficients. We propose an order-constrained version of ℓ 1-regularized regression (Lasso) for this problem, and show how to solve it efficiently using the well-known Pool Adjacent Violators Algorithm as its proximal operator. The main application of this idea is to time-lagged regression, where we predict an outcome at time t from features at the previous K time points. In this setting it is natural to assume that the coefficients decay as we move farther away from t, and hence the order constraint is reasonable. Potential application areas include financial time series and prediction of dynamic patient outcomes based on clinical measurements. We illustrate this idea on real and simulated data.
View details for DOI 10.1080/00401706.2015.1079245
View details for PubMedID 36909149
View details for PubMedCentralID PMC10004099
-
Cardiolipins Are Biomarkers of Mitochondria-Rich Thyroid Oncocytic Tumors.
Cancer research
2016: -?
Abstract
Oncocytic tumors are characterized by an excessive eosinophilic, granular cytoplasm due to aberrant accumulation of mitochondria. Mutations in mitochondrial DNA occur in oncocytic thyroid tumors, but there is no information about their lipid composition, which might reveal candidate theranostic molecules. Here, we used desorption electrospray ionization mass spectrometry (DESI-MS) to image and chemically characterize the lipid composition of oncocytic thyroid tumors, as compared with nononcocytic thyroid tumors and normal thyroid samples. We identified a novel molecular signature of oncocytic tumors characterized by an abnormally high abundance and chemical diversity of cardiolipins (CL), including many oxidized species. DESI-MS imaging and IHC experiments confirmed that the spatial distribution of CLs overlapped with regions of accumulation of mitochondria-rich oncocytic cells. Fluorescent imaging and mitochondrial isolation showed that both mitochondrial accumulation and alteration in CL composition of mitochondria occurred in oncocytic tumors cells, thus contributing the aberrant molecular signatures detected. A total of 219 molecular ions, including CLs, other glycerophospholipids, fatty acids, and metabolites, were found at increased or decreased abundance in oncocytic, nononcocytic, or normal thyroid tissues. Our findings suggest new candidate targets for clinical and therapeutic use against oncocytic tumors. Cancer Res; 76(22); 1-10. ©2016 AACR.
View details for PubMedID 27659048
-
Data Shared Lasso: A novel tool to discover uplift
COMPUTATIONAL STATISTICS & DATA ANALYSIS
2016; 101: 226-235
View details for DOI 10.1016/j.csda.2016.02.015
View details for Web of Science ID 000378444200017
-
Data Shared Lasso: A Novel Tool to Discover Uplift.
Computational statistics & data analysis
2016; 101: 226-235
Abstract
A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card.
View details for DOI 10.1016/j.csda.2016.02.015
View details for PubMedID 29056802
View details for PubMedCentralID PMC5650251
-
Pancreatic Cancer Surgical Resection Margins: Molecular Assessment by Mass Spectrometry Imaging.
PLoS medicine
2016; 13 (8)
Abstract
Surgical resection with microscopically negative margins remains the main curative option for pancreatic cancer; however, in practice intraoperative delineation of resection margins is challenging. Ambient mass spectrometry imaging has emerged as a powerful technique for chemical imaging and real-time diagnosis of tissue samples. We applied an approach combining desorption electrospray ionization mass spectrometry imaging (DESI-MSI) with the least absolute shrinkage and selection operator (Lasso) statistical method to diagnose pancreatic tissue sections and prospectively evaluate surgical resection margins from pancreatic cancer surgery.Our methodology was developed and tested using 63 banked pancreatic cancer samples and 65 samples (tumor and specimen margins) collected prospectively during 32 pancreatectomies from February 27, 2013, to January 16, 2015. In total, mass spectra for 254,235 individual pixels were evaluated. When cross-validation was employed in the training set of samples, 98.1% agreement with histopathology was obtained. Using an independent set of samples, 98.6% agreement was achieved. We used a statistical approach to evaluate 177,727 mass spectra from samples with complex, mixed histology, achieving an agreement of 81%. The developed method showed agreement with frozen section evaluation of specimen margins in 24 of 32 surgical cases prospectively evaluated. In the remaining eight patients, margins were found to be positive by DESI-MSI/Lasso, but negative by frozen section analysis. The median overall survival after resection was only 10 mo for these eight patients as opposed to 26 mo for patients with negative margins by both techniques. This observation suggests that our method (as opposed to the standard method to date) was able to detect tumor involvement at the margin in patients who developed early recurrence. Nonetheless, a larger cohort of samples is needed to validate the findings described in this study. Careful evaluation of the long-term benefits to patients of the use of DESI-MSI for surgical margin evaluation is also needed to determine its value in clinical practice.Our findings provide evidence that the molecular information obtained by DESI-MSI/Lasso from pancreatic tissue samples has the potential to transform the evaluation of surgical specimens. With further development, we believe the described methodology could be routinely used for intraoperative surgical margin assessment of pancreatic cancer.
View details for DOI 10.1371/journal.pmed.1002108
View details for PubMedID 27575375
-
Pathophysiological significance and therapeutic targeting of germinal center kinase in diffuse large B-cell lymphoma.
Blood
2016; 128 (2): 239-248
Abstract
Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma (NHL), yet 40-50% of patients will eventually succumb to their disease demonstrating a pressing need for novel therapeutic options. Gene expression profiling has identified messenger RNA's that lead to transformation, but critical events transforming cells are normally executed by kinases. Therefore, we hypothesized that previously unrecognized kinases may contribute to DLBCL pathogenesis. We performed the first comprehensive analysis of global kinase activity in DLBCL, to identify novel therapeutic targets, and discovered that Germinal Center Kinase (GCK) was extensively activated. GCK RNA interference and small molecule inhibition induced cell cycle arrest and apoptosis in DLBCL cell lines and primary tumors in vitro and decreased the tumor growth rate in vivo, resulting in a significantly extended lifespan of mice bearing DLBCL xenografts. GCK expression was also linked to adverse clinical outcome in a cohort of 151 primary DLBCL patients. These studies demonstrate, for the first time, that GCK is a molecular therapeutic target in DLBCL tumors and that inhibiting GCK may significantly extend DLBCL patient survival. Since the majority of DLBCL tumors (~80%) exhibit activation of GCK, this therapy may be applicable to most patients.
View details for DOI 10.1182/blood-2016-02-696856
View details for PubMedID 27151888
-
Exact Post-Selection Inference for Sequential Regression Procedures
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2016; 111 (514): 600-614
View details for DOI 10.1080/01621459.2015.1108848
View details for Web of Science ID 000381326700012
-
INFERENCE IN ADAPTIVE REGRESSION VIA THE KAC-RICE FORMULA
ANNALS OF STATISTICS
2016; 44 (2): 743-770
View details for DOI 10.1214/15-AOS1386
View details for Web of Science ID 000372594300011
-
Sparse regression and marginal testing using cluster prototypes.
Biostatistics
2016; 17 (2): 364-376
Abstract
We propose a new approach for sparse regression and marginal testing, for data with correlated features. Our procedure first clusters the features, and then chooses as the cluster prototype the most informative feature in that cluster. Then we apply either sparse regression (lasso) or marginal significance testing to these prototypes. While this kind of strategy is not entirely new, a key feature of our proposal is its use of the post-selection inference theory of Taylor and others (2014, Exact post-selection inference for forward stepwise and least angle regression, Preprint, arXiv:1401.3889) and Lee and others (2014, Exact post-selection inference with the lasso, Preprint, arXiv:1311.6238v5) to compute exact [Formula: see text]-values and confidence intervals that properly account for the selection of prototypes. We also apply the recent "knockoff" idea of Barber and Candès (2014, Controlling the false discovery rate via knockoffs, Preprint, arXiv:1404.5609) to provide exact finite sample control of the FDR of our regression procedure. We illustrate our proposals on both real and simulated data.
View details for DOI 10.1093/biostatistics/kxv049
View details for PubMedID 26614384
-
Successful immunotherapy induces previously unidentified allergen-specific CD4+ T-cell subsets.
Proceedings of the National Academy of Sciences of the United States of America
2016; 113 (9): E1286-95
Abstract
Allergen immunotherapy can desensitize even subjects with potentially lethal allergies, but the changes induced in T cells that underpin successful immunotherapy remain poorly understood. In a cohort of peanut-allergic participants, we used allergen-specific T-cell sorting and single-cell gene expression to trace the transcriptional "roadmap" of individual CD4+ T cells throughout immunotherapy. We found that successful immunotherapy induces allergen-specific CD4+ T cells to expand and shift toward an "anergic" Th2 T-cell phenotype largely absent in both pretreatment participants and healthy controls. These findings show that sustained success, even after immunotherapy is withdrawn, is associated with the induction, expansion, and maintenance of immunotherapy-specific memory and naive T-cell phenotypes as early as 3 mo into immunotherapy. These results suggest an approach for immune monitoring participants undergoing immunotherapy to predict the success of future treatment and could have implications for immunotherapy targets in other diseases like cancer, autoimmune disease, and transplantation.
View details for DOI 10.1073/pnas.1520180113
View details for PubMedID 26811452
-
Sequential selection procedures and false discovery rate control
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2016; 78 (2): 423-444
View details for DOI 10.1111/rssb.12122
View details for Web of Science ID 000369136600005
-
A STUDY OF ERROR VARIANCE ESTIMATION IN LASSO REGRESSION
STATISTICA SINICA
2016; 26 (1): 35-67
View details for DOI 10.5705/ss.2014.042
View details for Web of Science ID 000368972400002
-
CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE
ANNALS OF APPLIED STATISTICS
2015; 9 (4): 1709-1725
View details for DOI 10.1214/15-AOAS866
View details for Web of Science ID 000370445600001
-
CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE.
The annals of applied statistics
2015; 9 (4): 1709-1725
Abstract
We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal-customized training-clusters the data to find training points close to each test point and then fits an ℓ1-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of k-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.
View details for DOI 10.1214/15-AOAS866
View details for PubMedID 30370000
View details for PubMedCentralID PMC6200412
-
A Permutation Approach to Testing Interactions for Binary Response by Comparing Correlations Between Classes
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2015; 110 (512): 1707-1716
View details for DOI 10.1080/01621459.2014.993079
View details for Web of Science ID 000368797700041
-
A component lasso
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE
2015; 43 (4): 624-646
View details for DOI 10.1002/cjs.11267
View details for Web of Science ID 000367667700008
-
The Radiogenomic Risk Score: Construction of a Prognostic Quantitative, Noninvasive Image-based Molecular Assay for Renal Cell Carcinoma
RADIOLOGY
2015; 277 (1): 114-123
Abstract
Purpose To evaluate the feasibility of constructing radiogenomic-based surrogates of molecular assays (SOMAs) in patients with clear-cell renal cell carcinoma (CCRCC) by using data extracted from a single computed tomographic (CT) image. Materials and Methods In this institutional review board approved study, gene expression profile data and contrast material-enhanced CT images from 70 patients with CCRCC in a training set were independently assessed by two radiologists for a set of predefined imaging features. A SOMA for a previously validated CCRCC-specific supervised principal component (SPC) risk score prognostic gene signature was constructed and termed the radiogenomic risk score (RRS). It uses the microarray data and a 28-trait image array to evaluate each CT image with multiple regression of gene expression analysis. The predictive power of the RRS SOMA was then prospectively validated in an independent dataset to confirm its relationship to the SPC gene signature (n = 70) and determination of patient outcome (n = 77). Data were analyzed by using multivariate linear regression-based methods and Cox regression modeling, and significance was assessed with receiver operator characteristic curves and Kaplan-Meier survival analysis. Results Our SOMA faithfully represents the tissue-based molecular assay it models. The RRS scaled with the SPC gene signature (R = 0.57, P < .001, classification accuracy 70.1%, P < .001) and predicted disease-specific survival (log rank P < .001). Independent validation confirmed the relationship between the RRS and the SPC gene signature (R = 0.45, P < .001, classification accuracy 68.6%, P < .001) and disease-specific survival (log-rank P < .001) and that it was independent of stage, grade, and performance status (multivariate Cox model P < .05, log-rank P < .001). Conclusion A SOMA for the CCRCC-specific SPC prognostic gene signature that is predictive of disease-specific survival and independent of stage was constructed and validated, confirming that SOMA construction is feasible. (©) RSNA, 2015 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on August 24, 2015.
View details for DOI 10.1148/radiol.2015150800
View details for Web of Science ID 000368434000014
-
The Radiogenomic Risk Score: Construction of a Prognostic Quantitative, Noninvasive Image-based Molecular Assay for Renal Cell Carcinoma.
Radiology
2015; 277 (1): 114-23
Abstract
Purpose To evaluate the feasibility of constructing radiogenomic-based surrogates of molecular assays (SOMAs) in patients with clear-cell renal cell carcinoma (CCRCC) by using data extracted from a single computed tomographic (CT) image. Materials and Methods In this institutional review board approved study, gene expression profile data and contrast material-enhanced CT images from 70 patients with CCRCC in a training set were independently assessed by two radiologists for a set of predefined imaging features. A SOMA for a previously validated CCRCC-specific supervised principal component (SPC) risk score prognostic gene signature was constructed and termed the radiogenomic risk score (RRS). It uses the microarray data and a 28-trait image array to evaluate each CT image with multiple regression of gene expression analysis. The predictive power of the RRS SOMA was then prospectively validated in an independent dataset to confirm its relationship to the SPC gene signature (n = 70) and determination of patient outcome (n = 77). Data were analyzed by using multivariate linear regression-based methods and Cox regression modeling, and significance was assessed with receiver operator characteristic curves and Kaplan-Meier survival analysis. Results Our SOMA faithfully represents the tissue-based molecular assay it models. The RRS scaled with the SPC gene signature (R = 0.57, P < .001, classification accuracy 70.1%, P < .001) and predicted disease-specific survival (log rank P < .001). Independent validation confirmed the relationship between the RRS and the SPC gene signature (R = 0.45, P < .001, classification accuracy 68.6%, P < .001) and disease-specific survival (log-rank P < .001) and that it was independent of stage, grade, and performance status (multivariate Cox model P < .05, log-rank P < .001). Conclusion A SOMA for the CCRCC-specific SPC prognostic gene signature that is predictive of disease-specific survival and independent of stage was constructed and validated, confirming that SOMA construction is feasible. (©) RSNA, 2015 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on August 24, 2015.
View details for DOI 10.1148/radiol.2015150800
View details for PubMedID 26402495
-
Fibromyalgia and the Risk of a Subsequent Motor Vehicle Crash.
The Journal of rheumatology
2015; 42 (8): 1502-10
Abstract
Motor vehicle crashes are a widespread contributor to mortality and morbidity, sometimes related to medically unfit motorists. We tested whether patients diagnosed with fibromyalgia (FM) have an increased risk of a subsequent serious motor vehicle crash.We conducted a population-based self-matched longitudinal cohort analysis to estimate the incidence rate ratio of crashes among patients diagnosed with FM relative to the population norm in Ontario, Canada. We included adults diagnosed from April 1, 2006, to March 31, 2012, excluding individuals younger than 18 years, living outside Ontario, lacking valid identifiers, or having only a single visit for the diagnosis. The primary outcome was an emergency department visit as a driver involved in a motor vehicle crash.The patients (n = 137,631) accounted for 738 crashes during the first year of followup after diagnosis, equal to an incidence rate ratio of 2.44 compared with the population norm (95% CI 2.27-2.63, p < 0.001). The crash rate was more than twice the population norm for those with a new or a persistent diagnosis. The increased risk included patients with diverse characteristics, approached the rate observed among other patients diagnosed with alcoholism, and was mitigated among those who received dedicated FM care or a physician warning for driving safety.A diagnosis of FM is associated with an increased risk of a subsequent motor vehicle crash that might justify medical interventions for traffic safety.
View details for DOI 10.3899/jrheum.141315
View details for PubMedID 25979716
-
Statistical learning and selective inference
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2015; 112 (25): 7629-7634
Abstract
We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"-searched for the strongest associations-means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
View details for DOI 10.1073/pnas.1507583112
View details for Web of Science ID 000356731300047
View details for PubMedID 26100887
View details for PubMedCentralID PMC4485109
-
Collaborative regression
BIOSTATISTICS
2015; 16 (2): 326-338
Abstract
We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with these type of data is "sparse multiple canonical correlation analysis" (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem that can be implemented with off the shelf solvers, and illustrate their use on simulated and real data.
View details for DOI 10.1093/biostatistics/kxu047
View details for Web of Science ID 000354644900009
View details for PubMedID 25406332
View details for PubMedCentralID PMC4441100
-
CONVEX HIERARCHICAL TESTING OF INTERACTIONS
ANNALS OF APPLIED STATISTICS
2015; 9 (1): 27-42
View details for DOI 10.1214/14-AOAS758
View details for Web of Science ID 000358354400002
-
Molecular subtyping for clinically defined breast cancer subgroups
BREAST CANCER RESEARCH
2015; 17
Abstract
Breast cancer is commonly classified into intrinsic molecular subtypes. Standard gene centering is routinely done prior to molecular subtyping, but it can produce inaccurate classifications when the distribution of clinicopathological characteristics in the study cohort differs from that of the training cohort used to derive the classifier.We propose a subgroup-specific gene-centering method to perform molecular subtyping on a study cohort that has a skewed distribution of clinicopathological characteristics relative to the training cohort. On such a study cohort, we center each gene on a specified percentile, where the percentile is determined from a subgroup of the training cohort with clinicopathological characteristics similar to the study cohort. We demonstrate our method using the PAM50 classifier and its associated University of North Carolina (UNC) training cohort. We considered study cohorts with skewed clinicopathological characteristics, including subgroups composed of a single prototypic subtype of the UNC-PAM50 training cohort (n = 139), an external estrogen receptor (ER)-positive cohort (n = 48) and an external triple-negative cohort (n = 77).Subgroup-specific gene centering improved prediction performance with the accuracies between 77% and 100%, compared to accuracies between 17% and 33% from standard gene centering, when applied to the prototypic tumor subsets of the PAM50 training cohort. It reduced classification error rates on the ER-positive (11% versus 28%; P = 0.0389), the ER-negative (5% versus 41%; P < 0.0001) and the triple-negative (11% versus 56%; P = 0.1336) subgroups of the PAM50 training cohort. In addition, it produced higher accuracy for subtyping study cohorts composed of varying proportions of ER-positive versus ER-negative cases. Finally, it increased the percentage of assigned luminal subtypes on the external ER-positive cohort and basal-like subtype on the external triple-negative cohort.Gene centering is often necessary to accurately apply a molecular subtype classifier. Compared with standard gene centering, our proposed subgroup-specific gene centering produced more accurate molecular subtype assignments in a study cohort with skewed clinicopathological characteristics relative to the training cohort.
View details for DOI 10.1186/s13058-015-0520-4
View details for Web of Science ID 000351829500001
View details for PubMedID 25849221
View details for PubMedCentralID PMC4365540
-
Pancancer analysis of DNA methylation-driven genes using MethylMix
GENOME BIOLOGY
2015; 16
Abstract
Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.
View details for DOI 10.1186/s13059-014-0579-8
View details for Web of Science ID 000351817300001
View details for PubMedID 25631659
View details for PubMedCentralID PMC4365533
-
Pancancer analysis of DNA methylation-driven genes using MethylMix.
Genome biology
2015; 16: 17-?
Abstract
Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.
View details for DOI 10.1186/s13059-014-0579-8
View details for PubMedID 25631659
View details for PubMedCentralID PMC4365533
-
A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2014; 109 (508): 1517-1532
Abstract
We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.
View details for DOI 10.1080/01621459.2014.951443
View details for Web of Science ID 000346797000016
View details for PubMedCentralID PMC4338439
-
Quantitative SD-OCT imaging biomarkers as indicators of age-related macular degeneration progression.
Investigative ophthalmology & visual science
2014; 55 (11): 7093-7103
Abstract
Purpose: We developed a statistical model based on quantitative characteristics of drusen to estimate the likelihood of conversion from early and intermediate age-related macular degeneration (AMD) to its advanced exudative form (AMD progression) in the short term (less than 5 years), a crucial task to enable early intervention and improve outcomes. Methods: Image features of drusen quantifying their number, morphology, and reflectivity properties, as well as the longitudinal evolution in these characteristics, were automatically extracted from 2146 spectral domain optical coherence tomography (SD-OCT) scans of 330 AMD eyes in 244 patients collected over a period of 5 years, with 36 eyes showing progression during clinical follow-up. We developed and evaluated a statistical model to predict the likelihood of progression at pre-determined times using clinical and image features as predictors. Results: Area, volume, height, and reflectivity of drusen were informative features distinguishing between progressing and non-progressing cases. Discerning progression at follow-up (mean 6.16 months) resulted in a mean area under the receiver operating characteristic curve (AUC) of 0.74 ((0.58, 0.85) 95% confidence interval (CI)). The maximum predictive performance was observed at 11 months after a patient's first early AMD diagnosis, with mean AUC 0.92 ((0.83, 0.98) 95% CI). Those eyes predicted to progress showed a much higher progression rate than those predicted not to progress at any given time from the initial visit. Conclusions: Our results demonstrate the potential ability of our model to identify those AMD patients at risk of progressing to exudative AMD from an early or intermediate stage.
View details for DOI 10.1167/iovs.14-14918
View details for PubMedID 25301882
-
A Simple Method for Estimating Interactions between a Treatment and a Large Number of Covariates.
Journal of the American Statistical Association
2014; 109 (508): 1517-1532
Abstract
We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.
View details for DOI 10.1080/01621459.2014.951443
View details for PubMedID 25729117
View details for PubMedCentralID PMC4338439
-
Alteration of the lipid profile in lymphomas induced by MYC overexpression.
Proceedings of the National Academy of Sciences of the United States of America
2014; 111 (29): 10450-10455
Abstract
Overexpression of the v-myc avian myelocytomatosis viral oncogene homolog (MYC) oncogene is one of the most commonly implicated causes of human tumorigenesis. MYC is known to regulate many aspects of cellular biology including glucose and glutamine metabolism. Little is known about the relationship between MYC and the appearance and disappearance of specific lipid species. We use desorption electrospray ionization mass spectrometry imaging (DESI-MSI), statistical analysis, and conditional transgenic animal models and cell samples to investigate changes in lipid profiles in MYC-induced lymphoma. We have detected a lipid signature distinct from that observed in normal tissue and in rat sarcoma-induced lymphoma cells. We found 104 distinct molecular ions that have an altered abundance in MYC lymphoma compared with normal control tissue by statistical analysis with a false discovery rate of less than 5%. Of these, 86 molecular ions were specifically identified as complex phospholipids. To evaluate whether the lipid signature could also be observed in human tissue, we examined 15 human lymphoma samples with varying expression levels of MYC oncoprotein. Distinct lipid profiles in lymphomas with high and low MYC expression were observed, including many of the lipid species identified as significant for MYC-induced animal lymphoma tissue. Our results suggest a relationship between the appearance of specific lipid species and the overexpression of MYC in lymphomas.
View details for DOI 10.1073/pnas.1409778111
View details for PubMedID 24994904
-
Automated identification of stratifying signatures in cellular subpopulations
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2014; 111 (26): E2770-E2777
Abstract
Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased--and potentially more thorough--correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.
View details for DOI 10.1073/pnas.1408792111
View details for Web of Science ID 000338118900020
View details for PubMedCentralID PMC4084463
-
Automated identification of stratifying signatures in cellular subpopulations.
Proceedings of the National Academy of Sciences of the United States of America
2014; 111 (26): E2770-7
Abstract
Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased--and potentially more thorough--correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.
View details for DOI 10.1073/pnas.1408792111
View details for PubMedID 24979804
-
Active idiotypic vaccination versus control immunotherapy for follicular lymphoma.
Journal of clinical oncology
2014; 32 (17): 1797-1803
View details for DOI 10.1200/JCO.2012.43.9273
View details for PubMedID 24799467
-
Active idiotypic vaccination versus control immunotherapy for follicular lymphoma.
Journal of clinical oncology
2014; 32 (17): 1797-1803
Abstract
Idiotypes (Ids), the unique portions of tumor immunoglobulins, can serve as targets for passive and active immunotherapies for lymphoma. We performed a multicenter, randomized trial comparing a specific vaccine (MyVax), comprising Id chemically coupled to keyhole limpet hemocyanin (KLH) plus granulocyte macrophage colony-stimulating factor (GM-CSF) to a control immunotherapy with KLH plus GM-CSF.Patients with previously untreated advanced-stage follicular lymphoma (FL) received eight cycles of chemotherapy with cyclophosphamide, vincristine, and prednisone. Those achieving sustained partial or complete remission (n=287 [44%]) were randomly assigned at a ratio of 2:1 to receive one injection per month for 7 months of MyVax or control immunotherapy. Anti-Id antibody responses (humoral immune responses [IRs]) were measured before each immunization. The primary end point was progression-free survival (PFS). Secondary end points included IR and time to subsequent antilymphoma therapy.At a median follow-up of 58 months, no significant difference was observed in either PFS or time to next therapy between the two arms. In the MyVax group (n=195), anti-Id IRs were observed in 41% of patients, with a median PFS of 40 months, significantly exceeding the median PFS observed in patients without such Id-induced IRs and in those receiving control immunotherapy.This trial failed to demonstrate clinical benefit of specific immunotherapy. The subset of vaccinated patients mounting specific anti-Id responses had superior outcomes. Whether this reflects a therapeutic benefit or is a marker for more favorable underlying prognosis requires further study.
View details for DOI 10.1200/JCO.2012.43.9273
View details for PubMedID 24799467
-
LMO2 and BCL6 are associated with improved survival in primary central nervous system lymphoma
BRITISH JOURNAL OF HAEMATOLOGY
2014; 165 (5): 640-648
Abstract
Primary central nervous system lymphoma (PCNSL) is an aggressive sub-variant of non-Hodgkin lymphoma (NHL) with morphological similarities to diffuse large B-cell lymphoma (DLBCL). While methotrexate (MTX)-based therapies have improved patient survival, the disease remains incurable in most cases and its pathogenesis is poorly understood. We evaluated 69 cases of PCNSL for the expression of HGAL (also known as GCSAM), LMO2 and BCL6 - genes associated with DLBCL prognosis and pathobiology, and analysed their correlation to survival in 49 PCNSL patients receiving MTX-based therapy. We demonstrate that PCNSL expresses LMO2, HGAL(also known as GCSAM) and BCL6 proteins in 52%, 65% and 56% of tumours, respectively. BCL6 protein expression was associated with longer progression-free survival (P = 0·006) and overall survival (OS, P = 0·05), while expression of LMO2 protein was associated with longer OS (P = 0·027). Further research is needed to elucidate the function of BCL6 and LMO2 in PCNSL.
View details for DOI 10.1111/bjh.12801
View details for Web of Science ID 000335826500008
View details for PubMedID 24571259
View details for PubMedCentralID PMC4123533
-
Sensitivity analysis for inference with partially identifiable covariance matrices
COMPUTATIONAL STATISTICS
2014; 29 (3-4): 529-546
View details for DOI 10.1007/s00180-013-0451-4
View details for Web of Science ID 000336813100008
-
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package
JOURNAL OF STATISTICAL SOFTWARE
2014; 58 (12): 1-23
View details for Web of Science ID 000341642900001
-
A multicentre study of primary breast diffuse large B-cell lymphoma in the rituximab era
BRITISH JOURNAL OF HAEMATOLOGY
2014; 165 (3): 358-363
Abstract
Primary breast diffuse large B-cell lymphoma (DLBCL) is a rare subtype of non-Hodgkin lymphoma (NHL) with limited data on pathology and outcome. A multicentre retrospective study was undertaken to determine prognostic factors and the incidence of central nervous system (CNS) relapses. Data was retrospectively collected on patients from 8 US academic centres. Only patients with stage I/II disease (involvement of breast and localized lymph nodes) were included. Histologies apart from primary DLBCL were excluded. Between 1992 and 2012, 76 patients met the eligibility criteria. Most patients (86%) received chemotherapy, and 69% received immunochemotherapy with rituximab; 65% received radiation therapy and 9% received prophylactic CNS chemotherapy. After a median follow-up of 4·5 years (range 0·6-20·6 years), the Kaplan-Meier estimated median progression-free survival was 10·4 years (95% confidence interval [CI] 5·8-14·9 years), and the median overall survival was 14·6 years (95% CI 10·2-19 years). Twelve patients (16%) had CNS relapse. A low stage-modified International Prognostic Index (IPI) was associated with longer overall survival. Rituximab use was not associated with a survival advantage. Primary breast DLBCL has a high rate of CNS relapse. The stage-modified IPI score is associated with survival.
View details for DOI 10.1111/bjh.12753
View details for Web of Science ID 000334031000011
View details for PubMedID 24467658
View details for PubMedCentralID PMC3990235
-
A SIGNIFICANCE TEST FOR THE LASSO.
Annals of statistics
2014; 42 (2): 413-468
Abstract
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).
View details for DOI 10.1214/13-AOS1175
View details for PubMedID 25574062
View details for PubMedCentralID PMC4285373
-
A SIGNIFICANCE TEST FOR THE LASSO
ANNALS OF STATISTICS
2014; 42 (2): 413-468
View details for DOI 10.1214/13-AOS1175
View details for Web of Science ID 000336888400001
-
Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging.
Proceedings of the National Academy of Sciences of the United States of America
2014; 111 (7): 2436-2441
Abstract
Surgical resection is the main curative option for gastrointestinal cancers. The extent of cancer resection is commonly assessed during surgery by pathologic evaluation of (frozen sections of) the tissue at the resected specimen margin(s) to verify whether cancer is present. We compare this method to an alternative procedure, desorption electrospray ionization mass spectrometric imaging (DESI-MSI), for 62 banked human cancerous and normal gastric-tissue samples. In DESI-MSI, microdroplets strike the tissue sample, the resulting splash enters a mass spectrometer, and a statistical analysis, here, the Lasso method (which stands for least absolute shrinkage and selection operator and which is a multiclass logistic regression with L1 penalty), is applied to classify tissues based on the molecular information obtained directly from DESI-MSI. The methodology developed with 28 frozen training samples of clear histopathologic diagnosis showed an overall accuracy value of 98% for the 12,480 pixels evaluated in cross-validation (CV), and 97% when a completely independent set of samples was tested. By applying an additional spatial smoothing technique, the accuracy for both CV and the independent set of samples was 99% compared with histological diagnoses. To test our method for clinical use, we applied it to a total of 21 tissue-margin samples prospectively obtained from nine gastric-cancer patients. The results obtained suggest that DESI-MSI/Lasso may be valuable for routine intraoperative assessment of the specimen margins during gastric-cancer surgery.
View details for DOI 10.1073/pnas.1400274111
View details for PubMedID 24550265
-
Systems analysis of sex differences reveals an immunosuppressive role for testosterone in the response to influenza vaccination.
Proceedings of the National Academy of Sciences of the United States of America
2014; 111 (2): 869-874
Abstract
Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.
View details for DOI 10.1073/pnas.1321060111
View details for PubMedID 24367114
View details for PubMedCentralID PMC3896147
-
Increasing value and reducing waste in research design, conduct, and analysis.
Lancet
2014; 383 (9912): 166-175
Abstract
Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.
View details for DOI 10.1016/S0140-6736(13)62227-8
View details for PubMedID 24411645
-
A shared transcriptional program in early breast neoplasias despite genetic and clinical distinctions
GENOME BIOLOGY
2014; 15 (5)
Abstract
The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations.To characterize the transcriptional changes of early breast neoplasia, we sequenced 3'- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns.This study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.
View details for DOI 10.1186/gb-2014-15-5-r71
View details for Web of Science ID 000338981700005
View details for PubMedCentralID PMC4072957
-
Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data
STATISTICAL METHODS IN MEDICAL RESEARCH
2013; 22 (5): 519-536
Abstract
We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or 'sequencing depths'. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by 'outliers' in the data. We introduce a simple, non-parametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods.
View details for DOI 10.1177/0962280211428386
View details for Web of Science ID 000325863700005
View details for PubMedID 22127579
View details for PubMedCentralID PMC4605138
-
Identification of gene microarray expression profiles in patients with chronic graft-versus-host disease following allogeneic hematopoietic cell transplantation.
Clinical immunology
2013; 148 (1): 124-135
Abstract
Chronic graft-versus-host disease (GVHD) results in significant morbidity and mortality, limiting the benefit of allogeneic hematopoietic cell transplantation (HCT). Peripheral blood gene expression profiling of the donor immune repertoire following HCT may provide associated genes and pathways thereby improving the pathophysiologic understanding of chronic GVHD. We profiled 70 patients and identified candidate genes that provided mechanistic insight in the biologic pathways that underlie chronic GVHD. Our data revealed that the dominant gene signature in patients with chronic GVHD represented compensatory responses that control inflammation and included the interleukin-1 decoy receptor, IL-1 receptor type II, and genes that were profibrotic and associated with the IL-4, IL-6 and IL-10 signaling pathways. In addition, we identified three genes that were important regulators of extracellular matrix. Validation of this discovery phase study will determine if the identified genes have diagnostic, prognostic or therapeutic implications.
View details for DOI 10.1016/j.clim.2013.04.013
View details for PubMedID 23685278
-
A LASSO FOR HIERARCHICAL INTERACTIONS
ANNALS OF STATISTICS
2013; 41 (3): 1111-1141
View details for DOI 10.1214/13-AOS1096
View details for Web of Science ID 000321847600003
-
A LASSO FOR HIERARCHICAL INTERACTIONS.
Annals of statistics
2013; 41 (3): 1111-1141
Abstract
We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. We distinguish between parameter sparsity-the number of nonzero coefficients-and practical sparsity-the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method.
View details for DOI 10.1214/13-AOS1096
View details for PubMedID 26257447
View details for PubMedCentralID PMC4527358
-
A Sparse-Group Lasso
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2013; 22 (2): 231-245
View details for DOI 10.1080/10618600.2012.681250
View details for Web of Science ID 000319954000001
-
Classification of patients from time-course gene expression
BIOSTATISTICS
2013; 14 (1): 87-98
Abstract
Classifying patients into different risk groups based on their genomic measurements can help clinicians design appropriate clinical treatment plans. To produce such a classification, gene expression data were collected on a cohort of burn patients, who were monitored across multiple time points. This led us to develop a new classification method using time-course gene expressions. Our results showed that making good use of time-course information of gene expression improved the performance of classification compared with using gene expression from individual time points only. Our method is implemented into an R-package: time-course prediction analysis using microarray.
View details for DOI 10.1093/biostatistics/kxs027
View details for Web of Science ID 000312636300007
View details for PubMedID 22926914
View details for PubMedCentralID PMC3520502
-
Scientific research in the age of omics: the good, the bad, and the sloppy
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2013; 20 (1): 125-127
Abstract
It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is later determined to be flawed can erode the credibility of an entire field, resulting in a severe and negative impact for years to come. Here, we urge the development of an online, open-access, postpublication, peer review system that will increase the accountability of scientists for the quality of their research and the ability of readers to distinguish good from sloppy science.
View details for DOI 10.1136/amiajnl-2012-000972
View details for Web of Science ID 000313512900020
View details for PubMedID 23037799
-
Coronary risk assessment among intermediate risk patients using a clinical and biomarker based algorithm developed and validated in two population cohorts
CURRENT MEDICAL RESEARCH AND OPINION
2012; 28 (11): 1819-1830
Abstract
Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project (PMRP) individuals. A multivariable Cox regression model was fit using the most powerful risk predictors within the clinical and protein variables identified by repeated cross-validation. The resulting CHDRA algorithm was validated in a Multiple-Ethnic Study of Atherosclerosis (MESA) case-cohort sample.A CHDRA algorithm of age, sex, diabetes, and family history of MI, combined with serum levels of seven biomarkers (CTACK, Eotaxin, Fas Ligand, HGF, IL-16, MCP-3, and sFas) yielded a clinical net reclassification index of 42.7% (p < 0.001) for MESA patients with a recalibrated Framingham 5-year intermediate risk level. Across all patients, the model predicted acute coronary events (hazard ratio = 2.17, p < 0.001), and remained an independent predictor after Framingham risk factor adjustments.These include the slightly different event definition with the MESA samples and inability to include PMRP fatal CHD events.A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients. Improved accuracy in cardiovascular risk classification could lead to improved preventive care and fewer deaths.
View details for DOI 10.1185/03007995.2012.742878
View details for Web of Science ID 000310985600009
View details for PubMedID 23092312
View details for PubMedCentralID PMC3666558
-
Genome-wide Measurement of RNA Folding Energies
MOLECULAR CELL
2012; 48 (2): 169-181
Abstract
RNA structural transitions are important in the function and regulation of RNAs. Here, we reveal a layer of transcriptome organization in the form of RNA folding energies. By probing yeast RNA structures at different temperatures, we obtained relative melting temperatures (Tm) for RNA structures in over 4000 transcripts. Specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames and highlighted numerous candidate regulatory RNA motifs in 3' untranslated regions. RNA Tm distinguished noncoding versus coding RNAs and identified mRNAs with distinct cellular functions. We identified thousands of putative RNA thermometers, and their presence is predictive of the pattern of RNA decay in vivo during heat shock. The exosome complex recognizes unpaired bases during heat shock to degrade these RNAs, coupling intrinsic structural stabilities to gene regulation. Thus, genome-wide structural dynamics of RNA can parse functional elements of the transcriptome and reveal diverse biological insights.
View details for DOI 10.1016/j.molcel.2012.08.008
View details for PubMedID 22981864
-
Inference with Transposable Data: Modeling the Effects of Row and Column Correlations.
Journal of the Royal Statistical Society. Series B, Statistical methodology
2012; 74 (4): 721-743
Abstract
We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent due to latent variables or unknown batch effects. By modeling this matrix data using the matrix-variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously estimate row and column covariances and use these to sphere or de-correlate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance of the false discovery rate estimators.
View details for DOI 10.1111/j.1467-9868.2011.01027.x
View details for PubMedID 34880705
View details for PubMedCentralID PMC8649963
-
Normalization, testing, and false discovery rate estimation for RNA-sequencing data
BIOSTATISTICS
2012; 13 (3): 523-538
Abstract
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
View details for DOI 10.1093/biostatistics/kxr031
View details for Web of Science ID 000305420000012
View details for PubMedID 22003245
View details for PubMedCentralID PMC3372940
-
STANDARDIZATION AND THE GROUP LASSO PENALTY.
Statistica Sinica
2012; 22 (3): 983-1001
Abstract
We re-examine the original Group Lasso paper of Yuan and Lin (2007). The form of penalty in that paper seems to be designed for problems with uncorrelated features, but the statistical community has adopted it for general problems with correlated features. We show that for this general situation, a Group Lasso with a different choice of penalty matrix is generally more effective. We give insight into this formulation and show that it is intimately related to the uniformly most powerful invariant test for inclusion of a group. We demonstrate the efficacy of this method- the "standardized Group Lasso"- over the usual group lasso on real and simulated data sets. We also extend this to the Ridged Group Lasso to provide within group regularization as needed. We discuss a simple algorithm based on group-wise coordinate descent to fit both this standardized Group Lasso and Ridged Group Lasso.
View details for DOI 10.5705/ss.2011.075
View details for PubMedID 26257503
View details for PubMedCentralID PMC4527185
-
STANDARDIZATION AND THE GROUP LASSO PENALTY
STATISTICA SINICA
2012; 22 (3): 983-1001
View details for DOI 10.5705/ss.2011.075
View details for Web of Science ID 000307910300004
-
Autoantibody Epitope Spreading in the Pre-Clinical Phase Predicts Progression to Rheumatoid Arthritis
PLOS ONE
2012; 7 (5)
Abstract
Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1-12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.
View details for DOI 10.1371/journal.pone.0035296
View details for PubMedID 22662108
-
DEGREES OF FREEDOM IN LASSO PROBLEMS
ANNALS OF STATISTICS
2012; 40 (2): 1198-1232
View details for DOI 10.1214/12-AOS1003
View details for Web of Science ID 000307608000021
-
Strong rules for discarding predictors in lasso-type problems.
Journal of the Royal Statistical Society. Series B, Statistical methodology
2012; 74 (2): 245-266
Abstract
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
View details for DOI 10.1111/j.1467-9868.2011.01004.x
View details for PubMedID 25506256
View details for PubMedCentralID PMC4262615
-
In situ vaccination against mycosis fungoides by intratumoral injection of a TLR9 agonist combined with radiation: a phase 1/2 study
BLOOD
2012; 119 (2): 355-363
Abstract
We have developed and previously reported on a therapeutic vaccination strategy for indolent B-cell lymphoma that combines local radiation to enhance tumor immunogenicity with the injection into the tumor of a TLR9 agonist. As a result, antitumor CD8(+) T cells are induced, and systemic tumor regression was documented. Because the vaccination occurs in situ, there is no need to manufacture a vaccine product. We have now explored this strategy in a second disease: mycosis fungoides (MF). We treated 15 patients. Clinical responses were assessed at the distant, untreated sites as a measure of systemic antitumor activity. Five clinically meaningful responses were observed. The procedure was well tolerated and adverse effects consisted mostly of mild and transient injection site or flu-like symptoms. The immunized sites showed a significant reduction of CD25(+), Foxp3(+) T cells that could be either MF cells or tissue regulatory T cells and a similar reduction in S100(+), CD1a(+) dendritic cells. There was a trend toward greater reduction of CD25(+) T cells and skin dendritic cells in clinical responders versus nonresponders. Our in situ vaccination strategy is feasible also in MF and the clinical responses that occurred in a subset of patients warrant further study with modifications to augment these therapeutic effects. This study is registered at www.clinicaltrials.gov as NCT00226993.
View details for DOI 10.1182/blood-2011-05-355222
View details for PubMedID 22045986
-
Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers
GENOME BIOLOGY
2012; 13 (8)
Abstract
BACKGROUND: Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported. RESULTS: We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker. CONCLUSIONS: This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.
View details for Web of Science ID 000315867500009
-
Inference with transposable data: modelling the effects of row and column correlations
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2012; 74: 721-743
View details for DOI 10.1111/j.1467-9868.2011.01027.x
View details for Web of Science ID 000307550300004
-
Strong rules for discarding predictors in lasso-type problems
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2012; 74: 245-266
Abstract
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
View details for DOI 10.1111/j.1467-9868.2011.01004.x
View details for Web of Science ID 000301286200004
View details for PubMedCentralID PMC4262615
- Strong rules for discarding predictors in lasso-type problems J. Royal stat. Assoc B 2012; 74: 245-266
-
Sparse estimation of a covariance matrix
BIOMETRIKA
2011; 98 (4): 807-820
Abstract
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
View details for DOI 10.1093/biomet/asr054
View details for Web of Science ID 000297366000004
View details for PubMedCentralID PMC3413177
-
Sparse estimation of a covariance matrix.
Biometrika
2011; 98 (4): 807-820
Abstract
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
View details for DOI 10.1093/biomet/asr054
View details for PubMedID 23049130
View details for PubMedCentralID PMC3413177
-
PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION
ANNALS OF APPLIED STATISTICS
2011; 5 (4): 2403-2424
View details for DOI 10.1214/11-AOAS495
View details for Web of Science ID 000300382800008
-
A fused lasso latent feature model for analyzing multi-sample aCGH data
BIOSTATISTICS
2011; 12 (4): 776-791
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
View details for DOI 10.1093/biostatistics/kxr012
View details for Web of Science ID 000294806800014
View details for PubMedID 21642389
-
Hierarchical Clustering With Prototypes via Minimax Linkage
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2011; 106 (495): 1075-1084
View details for DOI 10.1198/jasa.2011.tm10183
View details for Web of Science ID 000296224200033
-
Prediction of survival in diffuse large B-cell lymphoma based on the expression of 2 genes reflecting tumor and microenvironment
BLOOD
2011; 118 (5): 1350-1358
Abstract
Several gene-expression signatures predict survival in diffuse large B-cell lymphoma (DLBCL), but the lack of practical methods for genome-scale analysis has limited translation to clinical practice. We built and validated a simple model using one gene expressed by tumor cells and another expressed by host immune cells, assessing added prognostic value to the clinical International Prognostic Index (IPI). LIM domain only 2 (LMO2) was validated as an independent predictor of survival and the "germinal center B cell-like" subtype. Expression of tumor necrosis factor receptor superfamily member 9 (TNFRSF9) from the DLBCL microenvironment was the best gene in bivariate combination with LMO2. Study of TNFRSF9 tissue expression in 95 patients with DLBCL showed expression limited to infiltrating T cells. A model integrating these 2 genes was independent of "cell-of-origin" classification, "stromal signatures," IPI, and added to the predictive power of the IPI. A composite score integrating these genes with IPI performed well in 3 independent cohorts of 545 DLBCL patients, as well as in a simple assay of routine formalin-fixed specimens from a new validation cohort of 147 patients with DLBCL. We conclude that the measurement of a single gene expressed by tumor cells (LMO2) and a single gene expressed by the immune microenvironment (TNFRSF9) powerfully predicts overall survival in patients with DLBCL.
View details for DOI 10.1182/blood-2011-03-345272
View details for PubMedID 21670469
-
NOVEL CELL-TYPE SPECIFIC DECONVOLUTION OF WHOLE-BLOOD GENE EXPRESSION PROFILES IN RENAL ACUTE REJECTION
WILEY-BLACKWELL. 2011: 79–80
View details for Web of Science ID 000293251100159
-
MicroRNAs Are Independent Predictors of Outcome in Diffuse Large B-Cell Lymphoma Patients Treated with R-CHOP
CLINICAL CANCER RESEARCH
2011; 17 (12): 4125-4135
Abstract
Diffuse large B-cell lymphoma (DLBCL) heterogeneity has prompted investigations for new biomarkers that can accurately predict survival. A previously reported 6-gene model combined with the International Prognostic Index (IPI) could predict patients' outcome. However, even these predictors are not capable of unambiguously identifying outcome, suggesting that additional biomarkers might improve their predictive power.We studied expression of 11 microRNAs (miRNA) that had previously been reported to have variable expression in DLBCL tumors. We measured the expression of each miRNA by quantitative real-time PCR analyses in 176 samples from uniformly treated DLBCL patients and correlated the results to survival.In a univariate analysis, the expression of miR-18a correlated with overall survival (OS), whereas the expression of miR-181a and miR-222 correlated with progression-free survival (PFS). A multivariate Cox regression analysis including the IPI, the 6-gene model-derived mortality predictor score and expression of the miR-18a, miR-181a, and miR-222, revealed that all variables were independent predictors of survival except the expression of miR-222 for OS and the expression of miR-18a for PFS.The expression of specific miRNAs may be useful for DLBCL survival prediction and their role in the pathogenesis of this disease should be examined further.
View details for DOI 10.1158/1078-0432.CCR-11-0224
View details for Web of Science ID 000291644700029
View details for PubMedID 21525173
View details for PubMedCentralID PMC3117929
-
THE SOLUTION PATH OF THE GENERALIZED LASSO
ANNALS OF STATISTICS
2011; 39 (3): 1335-1371
View details for DOI 10.1214/11-AOS878
View details for Web of Science ID 000293716500001
-
Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent
JOURNAL OF STATISTICAL SOFTWARE
2011; 39 (5): 1-13
Abstract
We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.
View details for Web of Science ID 000288204000001
View details for PubMedCentralID PMC4824408
-
Human transcriptome array for high-throughput clinical studies
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (9): 3707-3712
Abstract
A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.
View details for DOI 10.1073/pnas.1019753108
View details for Web of Science ID 000287844400051
View details for PubMedID 21317363
View details for PubMedCentralID PMC3048146
-
The Prognostic Value of Tumor-Associated Macrophages in Leiomyosarcoma A Single Institution Study
AMERICAN JOURNAL OF CLINICAL ONCOLOGY-CANCER CLINICAL TRIALS
2011; 34 (1): 82-86
Abstract
High numbers of tumor-associated macrophages (TAMs) have been associated with poor outcome in several solid tumors. In 2 previous studies, we showed that colony stimulating factor-1 (CSF1) is secreted by leiomyosarcoma (LMS) and that the increase in macrophages and CSF1 associated proteins are markers for poor prognosis in both gynecologic and nongynecologic LMS in a multicentered study. The purpose of this study is to evaluate the outcome of patients with LMS from a single institution according to the number of TAMs evaluated through 3 CSF1 associated proteins.Patients with LMS treated at Stanford University with adequate archived tissue and clinical data were eligible for this retrospective study. Data from chart reviews included tumor site, size, grade, stage, treatment, and disease status at the time of last follow-up. The 3 CSF1 associated proteins (CD163, CD16, and cathepsin L) were evaluated by immunohistochemistry on tissue microarrays. Kaplan-Meier survival curves and univariate Cox proportional hazards models were fit to assess the association of clinical predictors as well as CSF1 associated proteins with overall survival.A total of 52 patients diagnosed from 1983 to 2007 were evaluated. Univariate Cox proportional hazards models were fit to assess the significance of grade, size, stage, and the 3 CSF1 associated proteins in predicting OS. Grade, size, and stage were not significantly associated with survival in the full patient cohort, but grade and stage were significant predictors of survival in the gynecologic (GYN) LMS samples (P = 0.038 and P = 0.0164, respectively). Increased cathepsin L was associated with a worse outcome in GYN LMS (P = 0.049). Similar findings were seen with CD16 (P < 0.0001). In addition, CSF1 response enriched (all 3 stains positive) GYN LMS had a poor overall survival when compared with CSF1 response poor tumors (P = 0.001). These results were not seen in non-GYN LMS.Our data form an independent confirmation of the prognostic significance of TAMs and the CSF1 associated proteins in LMS. More aggressive or targeted therapies could be considered in the subset of LMS patients that highly express these markers.
View details for DOI 10.1097/COC.0b013e3181d26d5e
View details for PubMedID 23781555
-
Nearly-Isotonic Regression
TECHNOMETRICS
2011; 53 (1): 54-61
View details for DOI 10.1198/TECH.2010.10111
View details for Web of Science ID 000287436200005
-
Bayesian gene set analysis for identifying significant biological pathways
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS
2011; 60: 541-557
Abstract
We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.
View details for DOI 10.1111/j.1467-9876.2011.00765.x
View details for Web of Science ID 000293235800004
View details for PubMedCentralID PMC3156489
-
Supervised multidimensional scaling for visualization, classification, and bipartite ranking
COMPUTATIONAL STATISTICS & DATA ANALYSIS
2011; 55 (1): 789-801
View details for DOI 10.1016/j.csda.2010.07.001
View details for Web of Science ID 000283017900067
-
A statistician plays darts
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY
2011; 174: 213-226
View details for Web of Science ID 000285969600013
-
Adaptive index models for marker-based risk stratification
BIOSTATISTICS
2011; 12 (1): 68-86
Abstract
We use the term "index predictor" to denote a score that consists of K binary rules such as "age > 60" or "blood pressure > 120 mm Hg." The index predictor is the sum of these binary scores, yielding a value from 0 to K. Such indices as often used in clinical studies to stratify population risk: They are usually derived from subject area considerations. In this paper, we propose a fast data-driven procedure for automatically constructing such indices for linear, logistic, and Cox regression models. We also extend the procedure to create indices for detecting treatment-marker interactions. The methods are illustrated on a study with protein biomarkers as well as a large microarray gene expression study.
View details for DOI 10.1093/biostatistics/kxq047
View details for Web of Science ID 000285625800005
View details for PubMedID 20663850
View details for PubMedCentralID PMC3006126
-
Regression shrinkage and selection via the lasso: a retrospective
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2011; 73: 273-282
View details for Web of Science ID 000290575300001
-
Penalized classification using Fisher's linear discriminant
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2011; 73: 753-772
View details for DOI 10.1111/j.1467-9868.2011.00783.x
View details for Web of Science ID 000295969700006
-
In Situ Vaccination with TLR9 Agonist Combined with Local Radiation In Mycosis Fungoides: Analysis of Phase I/II Study
52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH)
AMER SOC HEMATOLOGY. 2010: 130–30
View details for Web of Science ID 000289662200287
-
Prediction of Survival In Diffuse Large B-Cell Lymphoma Based On the Expression of Two Genes Reflecting Tumor and Microenvironment
52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH)
AMER SOC HEMATOLOGY. 2010: 836–37
View details for Web of Science ID 000289662202229
-
In Situ Vaccination With a TLR9 Agonist Induces Systemic Lymphoma Regression: A Phase I/II Study
JOURNAL OF CLINICAL ONCOLOGY
2010; 28 (28): 4324-4332
Abstract
Combining tumor antigens with an immunostimulant can induce the immune system to specifically eliminate cancer cells. Generally, this combination is accomplished in an ex vivo, customized manner. In a preclinical lymphoma model, intratumoral injection of a Toll-like receptor 9 (TLR9) agonist induced systemic antitumor immunity and cured large, disseminated tumors.We treated 15 patients with low-grade B-cell lymphoma using low-dose radiotherapy to a single tumor site and-at that same site-injected the C-G enriched, synthetic oligodeoxynucleotide (also referred to as CpG) TLR9 agonist PF-3512676. Clinical responses were assessed at distant, untreated tumor sites. Immune responses were evaluated by measuring T-cell activation after in vitro restimulation with autologous tumor cells.This in situ vaccination maneuver was well-tolerated with only grade 1 to 2 local or systemic reactions and no treatment-limiting adverse events. One patient had a complete clinical response, three others had partial responses, and two patients had stable but continually regressing disease for periods significantly longer than that achieved with prior therapies. Vaccination induced tumor-reactive memory CD8 T cells. Some patients' tumors were able to induce a suppressive, regulatory phenotype in autologous T cells in vitro; these patients tended to have a shorter time to disease progression. One clinically responding patient received a second course of vaccination after relapse resulting in a second, more rapid clinical response.In situ tumor vaccination with a TLR9 agonist induces systemic antilymphoma clinical responses. This maneuver is clinically feasible and does not require the production of a customized vaccine product.
View details for DOI 10.1200/JCO.2010.28.9793
View details for Web of Science ID 000282272700032
View details for PubMedID 20697067
View details for PubMedCentralID PMC2954133
-
Spectral Regularization Algorithms for Learning Large Incomplete Matrices
JOURNAL OF MACHINE LEARNING RESEARCH
2010; 11: 2287-2322
Abstract
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
View details for Web of Science ID 000282523300010
View details for PubMedCentralID PMC3087301
-
Analysis of factorial time-course microarrays with application to a clinical study of burn injury.
Proceedings of the National Academy of Sciences of the United States of America
2010; 107 (22): 9923-9928
Abstract
Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at http://www.r-project.org/. It is also available for download at http://gluegrant1.stanford.edu/TANOVA/.
View details for DOI 10.1073/pnas.1002757107
View details for PubMedID 20479259
View details for PubMedCentralID PMC2890487
-
TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.
The annals of applied statistics
2010; 4 (2): 764-790
Abstract
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
View details for DOI 10.1214/09-AOAS314
View details for PubMedID 26877823
View details for PubMedCentralID PMC4751046
-
A Framework for Feature Selection in Clustering
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2010; 105 (490): 713-726
View details for DOI 10.1198/jasa.2010.tm09415
View details for Web of Science ID 000280216700023
-
TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION
ANNALS OF APPLIED STATISTICS
2010; 4 (2): 764-790
View details for DOI 10.1214/09-AOAS314
View details for Web of Science ID 000283528500011
-
Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls
BMC BIOLOGY
2010; 8
Abstract
Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run.We prepared small RNA libraries from 29 tumour/normal pairs of human cervical tissue samples. Analysis of the resulting sequences (42 million in total) defined 64 new human microRNA (miRNA) genes. Both arms of the hairpin precursor were observed in twenty-three of the newly identified miRNA candidates. We tested several computational approaches for the analysis of class differences between high throughput sequencing datasets and describe a novel application of a log linear model that has provided the most effective analysis for this data. This method resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001.This approach can potentially be applied to any kind of RNA sequencing data for analysing differential sequence representation between biological sample sets.
View details for DOI 10.1186/1741-7007-8-58
View details for Web of Science ID 000279780700001
View details for PubMedID 20459774
View details for PubMedCentralID PMC2880020
-
Cell type-specific gene expression differences in complex tissues
NATURE METHODS
2010; 7 (4): 287-289
Abstract
We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
View details for DOI 10.1038/NMETH.1439
View details for Web of Science ID 000276150600017
View details for PubMedID 20208531
-
Novel Cell-Type Specific Deconvolution of Whole-Blood Gene Expression Profiles in Renal Acute Rejection
10th American Transplant Congress
WILEY-BLACKWELL. 2010: 294–294
View details for Web of Science ID 000275921702289
-
C-C Chemokine Receptor 1 Expression in Human Hematolymphoid Neoplasia
AMERICAN JOURNAL OF CLINICAL PATHOLOGY
2010; 133 (3): 473-483
Abstract
Chemokine receptor 1 (CCR1) is a G protein-coupled receptor that binds to members of the C-C chemokine family. Recently, CCL3 (MIP-1alpha), a high-affinity CCR1 ligand, was identified as part of a model that independently predicts survival in patients with diffuse large B-cell lymphoma (DLBCL). However, the role of chemokine signaling in the pathogenesis of human lymphomas is unclear. In normal human hematopoietic tissues, we found CCR1 expression in intraepithelial B cells of human tonsil and granulocytic/monocytic cells in the bone marrow. Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma. CCR1 expression correlated with the non-germinal center subtype of DLBCL but did not predict overall survival in follicular lymphoma. These data suggest that CCR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of hematolymphoid neoplasia.
View details for DOI 10.1309/AJCP1TA3FLOQTMHF
View details for Web of Science ID 000274687800016
View details for PubMedID 20154287
-
Spectral Regularization Algorithms for Learning Large Incomplete Matrices.
Journal of machine learning research : JMLR
2010; 11: 2287-2322
Abstract
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
View details for PubMedID 21552465
View details for PubMedCentralID PMC3087301
-
Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling
ONCOGENE
2010; 29 (6): 845-854
Abstract
Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.
View details for DOI 10.1038/onc.2009.381
View details for Web of Science ID 000274397800007
View details for PubMedID 19901961
View details for PubMedCentralID PMC2820592
-
CD81 protein is expressed at high levels in normal germinal center B cells and in subtypes of human lymphomas
HUMAN PATHOLOGY
2010; 41 (2): 271-280
Abstract
CD81 is a tetraspanin cell surface protein that regulates CD19 expression in B lymphocytes and enables hepatitis C virus infection of human cells. Immunohistologic analysis in normal hematopoietic tissue showed strong staining for CD81 in normal germinal center B cells, a cell type in which its increased expression has not been previously recognized. High-dimensional flow cytometry analysis of normal hematopoietic tissue confirmed that among B- and T-cell subsets, germinal center B cells showed the highest level of CD81 expression. In more than 800 neoplastic tissue samples, its expression was also found in most non-Hodgkin lymphomas. Staining for CD81 was rarely seen in multiple myeloma, Hodgkin lymphoma, or myeloid leukemia. In hierarchical cluster analysis of diffuse large B-cell lymphoma, staining for CD81 was most similar to other germinal center B cell-associated markers, particularly LMO2. By flow cytometry, CD81 was expressed in diffuse large B-cell lymphoma cells independent of the presence or absence of CD10, another germinal center B-cell marker. The detection of CD81 in routine biopsy samples and its differential expression in lymphoma subtypes, particularly diffuse large B-cell lymphoma, warrant further study to assess CD81 expression and its role in the risk stratification of patients with diffuse large B-cell lymphoma.
View details for DOI 10.1016/j.humpath.2009.07.022
View details for Web of Science ID 000276493600015
View details for PubMedID 20004001
View details for PubMedCentralID PMC2813949
-
DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data
BIOINFORMATICS
2010; 26 (3): 414-416
Abstract
DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.DR-Integrator is freely available for non-commercial use from the Pollack Lab at http://pollacklab.stanford.edu/ and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name 'DRI' at http://cran.r-project.org/. An example analysis using DR-Integrator is included as supplemental material.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btp702
View details for Web of Science ID 000274342800021
View details for PubMedID 20031972
View details for PubMedCentralID PMC2815664
-
Regularization Paths for Generalized Linear Models via Coordinate Descent
JOURNAL OF STATISTICAL SOFTWARE
2010; 33 (1): 1-22
Abstract
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
View details for Web of Science ID 000275203200001
View details for PubMedCentralID PMC2929880
-
Survival analysis with high-dimensional covariates
STATISTICAL METHODS IN MEDICAL RESEARCH
2010; 19 (1): 29-51
Abstract
In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation. Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly. Here, we review a number of methods from the literature that address these two problems.
View details for DOI 10.1177/0962280209105024
View details for Web of Science ID 000274317100003
View details for PubMedID 19654171
-
3 '-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples
PLOS ONE
2010; 5 (1)
Abstract
Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address this limitation of gene expression microarrays, we designed a novel procedure (3'-end sequencing for expression quantification (3SEQ)) for gene expression profiling from FFPET using next-generation sequencing. We performed gene expression profiling by 3SEQ and microarray on both frozen tissue and FFPET from two soft tissue tumors (desmoid type fibromatosis (DTF) and solitary fibrous tumor (SFT)) (total n = 23 samples, which were each profiled by at least one of the four platform-tissue preparation combinations). Analysis of 3SEQ data revealed many genes differentially expressed between the tumor types (FDR<0.01) on both the frozen tissue (approximately 9.6K genes) and FFPET (approximately 8.1K genes). Analysis of microarray data from frozen tissue revealed fewer differentially expressed genes (approximately 4.64K), and analysis of microarray data on FFPET revealed very few (69) differentially expressed genes. Functional gene set analysis of 3SEQ data from both frozen tissue and FFPET identified biological pathways known to be important in DTF and SFT pathogenesis and suggested several additional candidate oncogenic pathways in these tumors. These findings demonstrate that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research.
View details for DOI 10.1371/journal.pone.0008768
View details for PubMedID 20098735
-
Predicting Patient Survival from Longitudinal Gene Expression
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
2010; 9 (1)
Abstract
Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.
View details for DOI 10.2202/1544-6115.1617
View details for Web of Science ID 000284905500002
View details for PubMedID 21126232
View details for PubMedCentralID PMC3004784
-
Extracting Cell-type-specific Gene Expression Differences from Complex Tissues
10th Annual Meeting of the Federation-of-Clinical-Immunology-Societies
ACADEMIC PRESS INC ELSEVIER SCIENCE. 2010: S10–S10
View details for DOI 10.1016/j.clim.2010.03.037
View details for Web of Science ID 000277953700021
-
Lymphoma cell VEGFR2 expression detected by immunohistochemistry predicts poor overall survival in diffuse large B cell lymphoma treated with immunochemotherapy (R-CHOP)
BRITISH JOURNAL OF HAEMATOLOGY
2010; 148 (2): 235-244
Abstract
Diffuse large B cell lymphoma (DLBCL) is clinically and biologically heterogeneous. In most cases of DLBCL, lymphoma cells co-express vascular endothelial growth factor (VEGF) and its receptors VEGFR1 and VEGFR2, suggesting autocrine in addition to angiogenic effects. We enumerated microvessel density and scored lymphoma cell expression of VEGF, VEGFR1, VEGFR2 and phosphorylated VEGFR2 in 162 de novo DLBCL patients treated with R-CHOP (rituximab, cyclophosphamide, vincristine, doxorubicin and prednisone)-like regimens. VEGFR2 expression correlated with shorter overall survival (OS) independent of International Prognostic Index (IPI) (P = 0.0028). Phosphorylated VEGFR2 (detected in 13% of cases) correlated with shorter progression-free survival (PFS, P = 0.044) and trended toward shorter OS on univariate analysis. VEGFR1 was not predictive of survival on univariate analysis, but it did correlate with better OS on multivariate analysis with VEGF, VEGFR2 and IPI (P = 0.036); in patients with weak VEGFR2, lack of VEGFR1 coexpression was significantly correlated with poor OS independent of IPI (P = 0.01). These results are concordant with our prior finding of an association of VEGFR1 with longer OS in DLBCL treated with chemotherapy alone. We postulate that VEGFR1 may oppose autocrine VEGFR2 signalling in DLBCL by competing for VEGF binding. In contrast to our prior results with chemotherapy alone, microvessel density was not prognostic of PFS or OS with R-CHOP-like therapy.
View details for DOI 10.1111/j.1365-2141.2009.07942.x
View details for PubMedID 19821819
-
Local false discovery rate facilitates comparison of different microarray experiments
NUCLEIC ACIDS RESEARCH
2009; 37 (22): 7483-7497
Abstract
The local false discovery rate (LFDR) estimates the probability of falsely identifying specific genes with changes in expression. In computer simulations, LFDR <10% successfully identified genes with changes in expression, while LFDR >90% identified genes without changes. We used LFDR to compare different microarray experiments quantitatively: (i) Venn diagrams of genes with and without changes in expression, (ii) scatter plots of the genes, (iii) correlation coefficients in the scatter plots and (iv) distributions of gene function. To illustrate, we compared three methods for pre-processing microarray data. Correlations between methods were high (r = 0.84-0.92). However, responses were often different in magnitude, and sometimes discordant, even though the methods used the same raw data. LFDR complements functional assessments like gene set enrichment analysis. To illustrate, we compared responses to ultraviolet radiation (UV), ionizing radiation (IR) and tobacco smoke. Compared to unresponsive genes, genes responsive to both UV and IR were enriched for cell cycle, mitosis, and DNA repair functions. Genes responsive to UV but not IR were depleted for cell adhesion functions. Genes responsive to tobacco smoke were enriched for detoxification functions. Thus, LFDR reveals differences and similarities among experiments.
View details for DOI 10.1093/nar/gkp813
View details for PubMedID 19825981
-
Relationship of differential gene expression profiles in CD34(+) myelodysplastic syndrome marrow cells to disease subtype and progression
BLOOD
2009; 114 (23): 4847-4858
Abstract
Microarray analysis with 40 000 cDNA gene chip arrays determined differential gene expression profiles (GEPs) in CD34(+) marrow cells from myelodysplastic syndrome (MDS) patients compared with healthy persons. Using focused bioinformatics analyses, we found 1175 genes significantly differentially expressed by MDS versus normal, requiring a minimum of 39 genes to separately classify these patients. Major GEP differences were demonstrated between healthy and MDS patients and between several MDS subgroups: (1) those whose disease remained stable and those who subsequently transformed (tMDS) to acute myeloid leukemia; (2) between del(5q) and other MDS patients. A 6-gene "poor risk" signature was defined, which was associated with acute myeloid leukemia transformation and provided additive prognostic information for International Prognostic Scoring System Intermediate-1 patients. Overexpression of genes generating ribosomal proteins and for other signaling pathways was demonstrated in the tMDS patients. Comparison of del(5q) with the remaining MDS patients showed 1924 differentially expressed genes, with underexpression of 1014 genes, 11 of which were within the 5q31-32 commonly deleted region. These data demonstrated (1) GEPs distinguishing MDS patients from healthy and between those with differing clinical outcomes (tMDS vs those whose disease remained stable) and cytogenetics [eg, del(5q)]; and (2) molecular criteria refining prognostic categorization and associated biologic processes in MDS.
View details for DOI 10.1182/blood-2009-08-236422
View details for PubMedID 19801443
-
Disease signatures are robust across tissues and experiments
MOLECULAR SYSTEMS BIOLOGY
2009; 5
Abstract
Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.
View details for DOI 10.1038/msb.2009.66
View details for Web of Science ID 000270456400006
View details for PubMedID 19756046
View details for PubMedCentralID PMC2758720
-
A Network Model of a Cooperative Genetic Landscape in Brain Tumors
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION
2009; 302 (3): 261-275
Abstract
Gliomas, particularly glioblastomas, are among the deadliest of human tumors. Gliomas emerge through the accumulation of recurrent chromosomal alterations, some of which target yet-to-be-discovered cancer genes. A persistent question concerns the biological basis for the coselection of these alterations during gliomagenesis.To describe a network model of a cooperative genetic landscape in gliomas and to evaluate its clinical relevance.Multidimensional genomic profiles and clinical profiles of 501 patients with gliomas (45 tumors in an initial discovery set collected between 2001 and 2004 and 456 tumors in validation sets made public between 2006 and 2008) from multiple academic centers in the United States and The Cancer Genome Atlas Pilot Project (TCGA).Identification of genes with coincident genetic alterations, correlated gene dosage and gene expression, and multiple functional interactions; association between those genes and patient survival.Gliomas select for a nonrandom genetic landscape-a consistent pattern of chromosomal alterations-that involves altered regions ("territories") on chromosomes 1p, 7, 8q, 9p, 10, 12q, 13q, 19q, 20, and 22q (false-discovery rate-corrected P<.05). A network model shows that these territories harbor genes with putative synergistic, tumor-promoting relationships. The coalteration of the most interactive of these genes in glioblastoma is associated with unfavorable patient survival. A multigene risk scoring model based on 7 landscape genes (POLD2, CYCS, MYC, AKR1C3, YME1L1, ANXA7, and PDCD4) is associated with the duration of overall survival in 189 glioblastoma samples from TCGA (global log-rank P = .02 comparing 3 survival curves for patients with 0-2, 3-4, and 5-7 dosage-altered genes). Groups of patients with 0 to 2 (low-risk group) and 5 to 7 (high-risk group) dosage-altered genes experienced 49.24 and 79.56 deaths per 100 person-years (hazard ratio [HR], 1.63; 95% confidence interval [CI], 1.10-2.40; Cox regression model P = .02), respectively. These associations with survival are validated using gene expression data in 3 independent glioma studies, comprising 76 (global log-rank P = .003; 47.89 vs 15.13 deaths per 100 person-years for high risk vs low risk; Cox model HR, 3.04; 95% CI, 1.49-6.20; P = .002) and 70 (global log-rank P = .008; 83.43 vs 16.14 deaths per 100 person-years for high risk vs low risk; HR, 3.86; 95% CI, 1.59-9.35; P = .003) high-grade gliomas and 191 glioblastomas (global log-rank P = .002; 83.23 vs 34.16 deaths per 100 person-years for high risk vs low risk; HR, 2.27; 95% CI, 1.44-3.58; P<.001).The alteration of multiple networking genes by recurrent chromosomal aberrations in gliomas deregulates critical signaling pathways through multiple, cooperative mechanisms. These mutations, which are likely due to nonrandom selection of a distinct genetic landscape during gliomagenesis, are associated with patient prognosis.
View details for Web of Science ID 000267948100020
View details for PubMedID 19602686
-
A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis
BIOSTATISTICS
2009; 10 (3): 515-534
Abstract
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
View details for DOI 10.1093/biostatistics/kxp008
View details for Web of Science ID 000267213700010
View details for PubMedID 19377034
View details for PubMedCentralID PMC2697346
-
Alteration of Gene Expression Signatures of Cortical Differentiation and Wound Response in Lethal Clear Cell Renal Cell Carcinomas
PLOS ONE
2009; 4 (6)
Abstract
Clear cell renal cell carcinoma (ccRCC) is the most common malignancy of the adult kidney and displays heterogeneity in clinical outcomes. Through comprehensive gene expression profiling, we have identified previously a set of transcripts that predict survival following nephrectomy independent of tumor stage, grade, and performance status. These transcripts, designated as the SPC (supervised principal components) gene set, show no apparent biological or genetic features that provide insight into renal carcinogenesis or tumor progression. We explored the relationship of this gene list to a set of genes expressed in different anatomical segments of the normal kidney including the cortex (cortex gene set) and the glomerulus (glomerulus gene set), and a gene set expressed after serum stimulation of quiescent fibroblasts (the core serum response or CSR gene set). Interestingly, the normal cortex, glomerulus (part of the normal renal cortex), and CSR gene sets captured more than 1/5 of the genes in the highly prognostic SPC gene set. Based on gene expression patterns alone, the SPC gene set could be used to sort samples from normal adult kidneys by the anatomical regions from which they were dissected. Tumors whose gene expression profiles most resembled the normal renal cortex or glomerulus showed better survival than those that did not, and those with expression features more similar to CSR showed poorer survival. While the cortex, glomerulus, and CSR signatures predicted survival independent of traditional clinical parameters, they were not independent of the SPC gene list. Our findings suggest that critical biological features of lethal ccRCC include loss of normal cortical differentiation and activation of programs associated with wound healing.
View details for DOI 10.1371/journal.pone.0006039
View details for Web of Science ID 000267356900003
View details for PubMedID 19557179
View details for PubMedCentralID PMC2698218
-
Anti-idiotype antibody response after vaccination correlates with better overall survival in follicular lymphoma
BLOOD
2009; 113 (23): 5743-5746
Abstract
Previous studies demonstrated that vaccination-induced tumor-specific immune response is associated with superior clinical outcome in patients with follicular lymphoma. Here, we investigated whether this positive correlation extends to overall survival (OS). We analyzed 91 untreated patients who received CVP chemotherapy (cyclophosphamide, vincristine, and prednisone) followed by idiotype vaccination. Idiotype proteins were produced either by the hybridoma method or by expression of recombinant idiotype-encoding sequences in mammalian or plant-based expression systems. We found that achieving a complete response/complete response unconfirmed (CR/CRu) to CVP and making an anti-idiotype antibody are 2 independent factors that each correlated with longer OS at 10 years (89% vs 68% with or without a CR/CRu, P = .024; 90% vs 69% with or without tumor-specific antibody production; P = .027). In the subset of patients who received hybridoma-generated vaccines, we found that anti-idiotype production was even more highly associated with superior OS (P < .002); this was the case even in patients with a partial response (PR) to CVP (P < .001).
View details for DOI 10.1182/blood-2009-01-201988
View details for Web of Science ID 000266656100013
View details for PubMedID 19346494
View details for PubMedCentralID PMC2700314
-
A BIAS CORRECTION FOR THE MINIMUM ERROR RATE IN CROSS-VALIDATION
ANNALS OF APPLIED STATISTICS
2009; 3 (2): 822-829
View details for DOI 10.1214/08-AOAS224
View details for Web of Science ID 000271979600014
-
Prognostic significance of vascular endothelial growth factor (VEGF), VEGF receptors (VEGFR), and vascularity in diffuse large B-cell lymphoma treated with immunochemotherapy (R-CHOP)
45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO)
AMER SOC CLINICAL ONCOLOGY. 2009
View details for Web of Science ID 000276606605490
-
Correlation of RRM1 expression in muscle invasive locally advanced urothelial cancer with age
45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO)
AMER SOC CLINICAL ONCOLOGY. 2009
View details for Web of Science ID 000276606604062
-
Differentiation stage-specific expression of microRNAs in B lymphocytes and diffuse large B-cell lymphomas
BLOOD
2009; 113 (16): 3754-3764
Abstract
miRNAs are small RNA molecules binding to partially complementary sites in the 3'-UTR of target transcripts and repressing their expression. miRNAs orchestrate multiple cellular functions and play critical roles in cell differentiation and cancer development. We analyzed miRNA profiles in B-cell subsets during peripheral B-cell differentiation as well as in diffuse large B-cell lymphoma (DLBCL) cells. Our results show temporal changes in the miRNA expression during B-cell differentiation with a highly unique miRNA profile in germinal center (GC) lymphocytes. We provide experimental evidence that these changes may be physiologically relevant by demonstrating that GC-enriched hsa-miR-125b down-regulates the expression of IRF4 and PRDM1/BLIMP1, and memory B cell-enriched hsa-miR-223 down-regulates the expression of LMO2. We further demonstrate that although an important component of the biology of a malignant cell is inherited from its nontransformed cellular progenitor-GC centroblasts-aberrant miRNA expression is acquired upon cell transformation. A 9-miRNA signature was identified that could precisely differentiate the 2 major subtypes of DLBCL. Finally, expression of some of the miRNAs in this signature is correlated with clinical outcome of uniformly treated DLBCL patients.
View details for DOI 10.1182/blood-2008-10-184077
View details for Web of Science ID 000265445900016
View details for PubMedID 19047678
-
Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
JOURNAL OF MACHINE LEARNING RESEARCH
2009; 10: 883-906
Abstract
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.
View details for Web of Science ID 000270824600003
View details for PubMedCentralID PMC3157941
-
Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods.
Journal of machine learning research : JMLR
2009; 10: 883-906
Abstract
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.
View details for PubMedID 21857799
View details for PubMedCentralID PMC3157941
-
Covariance-regularized regression and classification for high-dimensional problems.
Journal of the Royal Statistical Society. Series B, Statistical methodology
2009; 71 (3): 615-636
Abstract
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
View details for DOI 10.1111/j.1467-9868.2009.00699.x
View details for PubMedID 20084176
View details for PubMedCentralID PMC2806603
-
Temporal Changes in Gene Expression Induced by Sulforaphane in Human Prostate Cancer Cells
PROSTATE
2009; 69 (2): 181-190
Abstract
Prostate cancer is thought to arise as a result of oxidative stresses and induction of antioxidant electrophile defense (phase 2) enzymes has been proposed as a prostate cancer prevention strategy. The isothiocyanate sulforaphane, derived from cruciferous vegetables like broccoli, potently induces surrogate markers of phase 2 enzyme activity in prostate cells in vitro and in vivo. To better understand the temporal effects of sulforaphane and broccoli sprouts on gene expression in prostate cells, we carried out comprehensive transcriptome analysis using cDNA microarrays.Transcripts significantly modulated by sulforaphane over time were identified using StepMiner analysis. Ingenuity Pathway Analysis (IPA) was used to identify biological pathways, networks, and functions significantly altered by sulforaphane treatment.StepMiner and IPA revealed significant changes in many transcripts associated with cell growth and cell cycle, as well as a significant number associated with cellular response to oxidative damage and stress. Comparison to an existing dataset suggested that sulforaphane blocked cell growth by inducing G2/M arrest. Cell growth assays and flow cytometry analysis confirmed that sulforaphane inhibited cell growth and induced cell cycle arrest.Our data suggest that in prostate cells sulforaphane primarily induces cellular defenses and inhibits cell growth by causing G2/M phase arrest. Furthermore, based on the striking similarities in the gene expression patterns induced across experiments in these cells, sulforaphane appears to be the primary bioactive compound present in broccoli sprouts, suggesting that broccoli sprouts can serve as a suitable source for sulforaphane in intervention trials.
View details for DOI 10.1002/pros.20869
View details for Web of Science ID 000262701200008
View details for PubMedID 18973173
View details for PubMedCentralID PMC2612096
-
Blood autoantibody and cytokine profiles predict response to anti-tumor necrosis factor therapy in rheumatoid arthritis
ARTHRITIS RESEARCH & THERAPY
2009; 11 (3)
Abstract
Anti-TNF therapies have revolutionized the treatment of rheumatoid arthritis (RA), a common systemic autoimmune disease involving destruction of the synovial joints. However, in the practice of rheumatology approximately one-third of patients demonstrate no clinical improvement in response to treatment with anti-TNF therapies, while another third demonstrate a partial response, and one-third an excellent and sustained response. Since no clinical or laboratory tests are available to predict response to anti-TNF therapies, great need exists for predictive biomarkers.Here we present a multi-step proteomics approach using arthritis antigen arrays, a multiplex cytokine assay, and conventional ELISA, with the objective to identify a biomarker signature in three ethnically diverse cohorts of RA patients treated with the anti-TNF therapy etanercept.We identified a 24-biomarker signature that enabled prediction of a positive clinical response to etanercept in all three cohorts (positive predictive values 58 to 72%; negative predictive values 63 to 78%).We identified a multi-parameter protein biomarker that enables pretreatment classification and prediction of etanercept responders, and tested this biomarker using three independent cohorts of RA patients. Although further validation in prospective and larger cohorts is needed, our observations demonstrate that multiplex characterization of autoantibodies and cytokines provides clinical utility for predicting response to the anti-TNF therapy etanercept in RA patients.
View details for DOI 10.1186/ar2706
View details for PubMedID 19460157
-
Covariance-regularized regression and classification for high dimensional problems
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2009; 71: 615-636
Abstract
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
View details for DOI 10.1111/j.1467-9868.2009.00699.x
View details for Web of Science ID 000266602200003
View details for PubMedCentralID PMC2806603
-
Univariate Shrinkage in the Cox Model for High Dimensional Data
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
2009; 8 (1)
Abstract
We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.
View details for Web of Science ID 000265689500003
View details for PubMedID 19409065
-
Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
2009; 8 (1)
Abstract
In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.
View details for DOI 10.2202/1544-6115.1470
View details for Web of Science ID 000267601500008
View details for PubMedID 19572827
View details for PubMedCentralID PMC2861323
-
CD81 Protein Is Expressed in Normal Germinal Center B-Cells and in Subtypes of Human Non-Hodgkin Lymphomas
98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2009: 275A–275A
View details for Web of Science ID 000262371501249
-
Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling
98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2009: 368A–368A
View details for Web of Science ID 000262371501667
-
Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling
98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2009: 368A–368A
View details for Web of Science ID 000262486301668
-
Lymphoma-Expressed VEGF-a,VEGFR-1, VEGFR-2, and Microvessel Density Are Not Predictive of Overall Survival in Follicular Lymphoma.
50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium
AMER SOC HEMATOLOGY. 2008: 1290–90
View details for Web of Science ID 000262104704385
-
Differentiation-Stage-Specific Expression of MicroRNAs in B-Lymphocytes and Diffuse Large B-Cell Lymphomas (DLBCL)
50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium
AMER SOC HEMATOLOGY. 2008: 299–99
View details for Web of Science ID 000262104701029
-
LMO2 Protein Expression Predicts Survival in Patients with Diffuse Large B-Cell Lymphoma Treated with Immunochemotherapy (RCHOP): A Multicenter Validation Study.
50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium
AMER SOC HEMATOLOGY. 2008: 1291–91
View details for Web of Science ID 000262104704387
-
Neither CD68+Nor CD163+Macrophages Are Associated with Decreased Survival in Follicular Lymphoma
50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium
AMER SOC HEMATOLOGY. 2008: 1284–84
View details for Web of Science ID 000262104704365
-
TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS
ANNALS OF APPLIED STATISTICS
2008; 2 (3): 986-1012
Abstract
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.
View details for DOI 10.1214/08-AOAS182
View details for Web of Science ID 000261057900009
View details for PubMedCentralID PMC2743444
-
TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS.
The annals of applied statistics
2008; 2 (3): 986-1012
Abstract
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.
View details for DOI 10.1214/08-AOAS182SUPP
View details for PubMedID 19756232
View details for PubMedCentralID PMC2743444
-
"Preconditioning" for feature selection and regression in high-dimensional problems'
ANNALS OF STATISTICS
2008; 36 (4): 1595-1618
View details for DOI 10.1214/009053607000000578
View details for Web of Science ID 000258243000007
-
Sparse inverse covariance estimation with the graphical lasso
BIOSTATISTICS
2008; 9 (3): 432-441
Abstract
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
View details for DOI 10.1093/biostatistics/kxm045
View details for Web of Science ID 000256977000005
View details for PubMedID 18079126
View details for PubMedCentralID PMC3019769
-
Complementary hierarchical clustering
BIOSTATISTICS
2008; 9 (3): 467-483
Abstract
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.
View details for DOI 10.1093/biostatistics/kxm046
View details for Web of Science ID 000256977000008
View details for PubMedID 18093965
View details for PubMedCentralID PMC3294318
-
Paraffin-based 6-gene model predicts outcome in diffuse large B-cell lymphoma patients treated with R-CHOP
BLOOD
2008; 111 (12): 5509-5514
Abstract
Diffuse large B-cell lymphoma (DLBCL) is a heterogeneous disease characterized by variable clinical outcomes. Outcome prediction at the time of diagnosis is of paramount importance. Previously, we constructed a 6-gene model for outcome prediction of DLBCL patients treated with anthracycline-based chemotherapies. However, the standard therapy has evolved into rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone (R-CHOP). Herein, we evaluated the predictive power of a paraffin-based 6-gene model in R-CHOP-treated DLBCL patients. RNA was successfully extracted from 132 formalin-fixed paraffin-embedded (FFPE) specimens. Expression of the 6 genes comprising the model was measured and the mortality predictor score was calculated for each patient. The mortality predictor score divided patients into low-risk (below median) and high-risk (above median) subgroups with significantly different overall survival (OS; P = .002) and progression-free survival (PFS; P = .038). The model also predicted OS and PFS when the mortality predictor score was considered as a continuous variable (P = .002 and .010, respectively) and was independent of the IPI for prediction of OS (P = .008). These findings demonstrate that the prognostic value of the 6-gene model remains significant in the era of R-CHOP treatment and that the model can be applied to routine FFPE tissue from initial diagnostic biopsies.
View details for DOI 10.1182/blood-2008-02-136374
View details for Web of Science ID 000256786500021
View details for PubMedID 18445689
View details for PubMedCentralID PMC2424149
-
A STUDY OF PRE-VALIDATION
ANNALS OF APPLIED STATISTICS
2008; 2 (2): 643-664
View details for DOI 10.1214/07-AOAS152
View details for Web of Science ID 000261057800015
-
An FLT3 gene-expression signature predicts clinical outcome in normal karyotype AML
BLOOD
2008; 111 (9): 4490-4495
Abstract
Acute myeloid leukemia with normal karyotype (NK-AML) represents a cytogenetic grouping with intermediate prognosis but substantial molecular and clinical heterogeneity. Within this subgroup, presence of FLT3 (FMS-like tyrosine kinase 3) internal tandem duplication (ITD) mutation predicts less favorable outcome. The goal of our study was to discover gene-expression patterns correlated with FLT3-ITD mutation and to evaluate the utility of a FLT3 signature for prognostication. DNA microarrays were used to profile gene expression in a training set of 65 NK-AML cases, and supervised analysis, using the Prediction Analysis of Microarrays method, was applied to build a gene expression-based predictor of FLT3-ITD mutation status. The optimal predictor, composed of 20 genes, was then evaluated by classifying expression profiles from an independent test set of 72 NK-AML cases. The predictor exhibited modest performance (73% sensitivity; 85% specificity) in classifying FLT3-ITD status. Remarkably, however, the signature outperformed FLT3-ITD mutation status in predicting clinical outcome. The signature may better define clinically relevant FLT3 signaling and/or alternative changes that phenocopy FLT3-ITD, whereas the signature genes provide a starting point to dissect these pathways. Our findings support the potential clinical utility of a gene expression-based measure of FLT3 pathway activation in AML.
View details for DOI 10.1182/blood-2007-09-115055
View details for Web of Science ID 000255387400016
View details for PubMedID 18309032
-
IRF9 and STAT1 are required for IgG autoantibody production and B cell expression of TLR7 in mice
JOURNAL OF CLINICAL INVESTIGATION
2008; 118 (4): 1417-1426
Abstract
A hallmark of SLE is the production of high-titer, high-affinity, isotype-switched IgG autoantibodies directed against nucleic acid-associated antigens. Several studies have established a role for both type I IFN (IFN-I) and the activation of TLRs by nucleic acid-associated autoantigens in the pathogenesis of this disease. Here, we demonstrate that 2 IFN-I signaling molecules, IFN regulatory factor 9 (IRF9) and STAT1, were required for the production of IgG autoantibodies in the pristane-induced mouse model of SLE. In addition, levels of IgM autoantibodies were increased in pristane-treated Irf9 -/- mice, suggesting that IRF9 plays a role in isotype switching in response to self antigens. Upregulation of TLR7 by IFN-alpha was greatly reduced in Irf9 -/- and Stat1 -/- B cells. Irf9 -/- B cells were incapable of being activated through TLR7, and Stat1 -/- B cells were impaired in activation through both TLR7 and TLR9. These data may reveal a novel role for IFN-I signaling molecules in both TLR-specific B cell responses and production of IgG autoantibodies directed against nucleic acid-associated autoantigens. Our results suggest that IFN-I is upstream of TLR signaling in the activation of autoreactive B cells in SLE.
View details for DOI 10.1172/JCI30065
View details for Web of Science ID 000254588600035
View details for PubMedID 18340381
View details for PubMedCentralID PMC2267033
-
Multiplexed proximity ligation assays to profile putative plasma biomarkers relevant to pancreatic and ovarian cancer
CLINICAL CHEMISTRY
2008; 54 (3): 582-589
Abstract
Sensitive methods are needed for biomarker discovery and validation. We tested one promising technology, multiplex proximity ligation assay (PLA), in a pilot study profiling plasma biomarkers in pancreatic and ovarian cancer.We used 4 panels of 6- and 7-plex PLAs to detect biomarkers, with each assay consuming 1 microL plasma and using either matched monoclonal antibody pairs or single batches of polyclonal antibody. Protein analytes were converted to unique DNA amplicons by proximity ligation and subsequently detected by quantitative PCR. We profiled 18 pancreatic cancer cases and 19 controls and 19 ovarian cancer cases and 20 controls for the following proteins: a disintegrin and metalloprotease 8, CA-125, CA 19-9, carboxypeptidase A1, carcinoembryonic antigen, connective tissue growth factor, epidermal growth factor receptor, epithelial cell adhesion molecule, Her2, galectin-1, insulin-like growth factor 2, interleukin-1alpha, interleukin-7, mesothelin, macrophage migration inhibitory factor, osteopontin, secretory leukocyte peptidase inhibitor, tumor necrosis factor alpha, vascular endothelial growth factor, and chitinase 3-like 1. Probes for CA-125 were present in 3 of the multiplex panels. We measured plasma concentrations of the CA-125-mesothelin complex by use of a triple-specific PLA with 2 ligation events among 3 probes.The assays displayed consistent measurements of CA-125 independent of which other markers were simultaneously detected and showed good correlation with Luminex data. In comparison to literature reports, we achieved expected results for other putative markers.Multiplex PLA using either matched monoclonal antibodies or single batches of polyclonal antibody should prove useful for identifying and validating sets of putative disease biomarkers and finding multimarker panels.
View details for DOI 10.1373/clinchem.2007.093195
View details for Web of Science ID 000253570400019
View details for PubMedID 18171715
-
hCAP-D3 expression marks a prostate cancer subtype with favorable clinical behavior and androgen signaling signature
AMERICAN JOURNAL OF SURGICAL PATHOLOGY
2008; 32 (2): 205-209
Abstract
Growing evidence suggests that only a fraction of prostate cancers detected clinically are potentially lethal. An important clinical issue is identifying men with indolent cancer who might be spared aggressive therapies with associated morbidities. Previously, using microarray analysis we defined 3 molecular subtypes of prostate cancer with different gene-expression patterns. One, subtype-1, displayed features consistent with more indolent behavior, where an immunohistochemical marker (AZGP1) for subtype-1 predicted favorable outcome after radical prostatectomy. Here we characterize a second candidate tissue biomarker, hCAP-D3, expressed in subtype-1 prostate tumors. hCAP-D3 expression, assayed by RNA in situ hybridization on a tissue microarray comprising 225 cases, was associated with decreased tumor recurrence after radical prostatectomy (P=0.004), independent of pathologic tumor stage, Gleason grade, and preoperative prostate-specific antigen levels. Simultaneous assessment of hCAP-D3 and AZGP1 expression in this tumor set improved outcome prediction. We have previously demonstrated that hCAP-D3 is induced by androgen in prostate cells. Extending this finding, Gene Set Enrichment Analysis revealed enrichment of androgen-responsive genes in subtype-1 tumors (P=0.019). Our findings identify hCAP-D3 as a new biomarker for subtype-1 tumors that improves prognostication, and reveal androgen signaling as an important biologic feature of this potentially clinically favorable molecular subtype.
View details for PubMedID 18223322
-
LMO2 protein expression predicts survival in patients with diffuse large B-Cell lymphoma treated with anthracycline-based chemotherapy with and without rituximab
JOURNAL OF CLINICAL ONCOLOGY
2008; 26 (3): 447-454
Abstract
The heterogeneity of diffuse large B-cell lymphoma (DLBCL) has prompted the search for new markers that can accurately separate prognostic risk groups. We previously showed in a multivariate model that LMO2 mRNA was a strong predictor of superior outcome in DLBCL patients. Here, we tested the prognostic impact of LMO2 protein expression in DLBCL patients treated with anthracycline-based chemotherapy with or without rituximab.DLBCL patients treated with anthracycline-based chemotherapy alone (263 patients) or with the addition of rituximab (80 patients) were studied using immunohistochemistry for LMO2 on tissue microarrays of original biopsies. Staining results were correlated with outcome.In anthracycline-treated patients, LMO2 protein expression was significantly correlated with improved overall survival (OS) and progression-free survival (PFS) in univariate analyses (OS, P = .018; PFS, P = .010) and was a significant predictor independent of the clinical International Prognostic Index (IPI) in multivariate analysis. Similarly, in patients treated with the combination of anthracycline-containing regimens and rituximab, LMO2 protein expression was also significantly correlated with improved OS and PFS (OS, P = .005; PFS, P = .009) and was a significant predictor independent of the IPI in multivariate analysis.We conclude that LMO2 protein expression is a prognostic marker in DLBCL patients treated with anthracycline-based regimens alone or in combination with rituximab. After further validation, immunohistologic analysis of LMO2 protein expression may become a practical assay for newly diagnosed DLBCL patients to optimize their clinical management.
View details for DOI 10.1200/JCO.2007.13.0690
View details for Web of Science ID 000254177200020
View details for PubMedID 18086797
-
Boolean implication networks derived from large scale, whole genome microarray datasets
GENOME BIOLOGY
2008; 9 (10)
Abstract
We describe a method for extracting Boolean implications (if-then relationships) in very large amounts of gene expression microarray data. A meta-analysis of data from thousands of microarrays for humans, mice, and fruit flies finds millions of implication relationships between genes that would be missed by other methods. These relationships capture gender differences, tissue differences, development, and differentiation. New relationships are discovered that are preserved across all three species.
View details for PubMedID 18973690
-
LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma treated with anthracycline-based chemotherapy with or without rituximab
97th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2008: 267A–267A
View details for Web of Science ID 000252180201350
-
Prognostic significance of VEGF, VEGF receptors, and microvessel density in diffuse large B cell lymphoma treated with anthracycline-based chemotherapy
LABORATORY INVESTIGATION
2008; 88 (1): 38-47
Abstract
Vascular endothelial growth factor-mediated signaling has at least two potential roles in diffuse large B cell lymphoma: potentiation of angiogenesis, and potentiation of lymphoma cell proliferation and/or survival induced by autocrine vascular endothelial growth factor receptor-mediated signaling. We have recently shown that diffuse large B cell lymphomas expressing high levels of vascular endothelial growth factor protein also express high levels of vascular endothelial growth factor receptor-1 and vascular endothelial growth factor receptor-2. We have now assessed a larger multi-institutional cohort of patients with de novo diffuse large B cell lymphoma treated with anthracycline-based therapy to address whether tumor vascularity, or expression of vascular endothelial growth factor protein and its receptors, contribute to patient outcomes. Our results show that increased tumor vascularity is associated with poor overall survival (P=0.047), and is independent of the international prognostic index. High expression of vascular endothelial growth factor receptor-1 by lymphoma cells by contrast is associated with improved overall survival (P=0.044). The combination of high vascular endothelial growth factor and vascular endothelial growth factor receptor-1 protein expression by lymphoma cells identifies a subgroup of patients with improved overall (P=0.003) and progression-free (P=0.026) survival; these findings are also independent of the international prognostic index. The prognostic significance of overexpression of this ligand-receptor pair suggests that autocrine signaling via vascular endothelial growth factor receptor-1 may represent a survival or proliferation pathway in diffuse large B cell lymphoma. Dependence on autocrine vascular endothelial growth factor receptor-1-mediated signaling may render a subset of diffuse large B-cell lymphomas susceptible to anthracycline-based therapy.
View details for DOI 10.1038/labinvest.3700697
View details for Web of Science ID 000251820600004
View details for PubMedID 17998899
-
Spatial smoothing and hot spot detection for CGH data using the fused lasso
BIOSTATISTICS
2008; 9 (1): 18-29
Abstract
We apply the "fused lasso" regression method of (TSRZ2004) to the problem of "hot- spot detection", in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.
View details for DOI 10.1093/biostatistics/kxm013
View details for Web of Science ID 000251679400002
View details for PubMedID 17513312
-
Polymorphisms in hypoxia inducible factor 1 and the initial clinical presentation of coronary disease
AMERICAN HEART JOURNAL
2007; 154 (6): 1035-1042
Abstract
Only some patients with coronary artery disease (CAD) develop acute myocardial infarction (MI), and emerging evidence suggests vulnerability to MI varies systematically among patients and may have a genetic component. The goal of this study was to assess whether polymorphisms in genes encoding elements of pathways mediating the response to ischemia affect vulnerability to MI among patients with underlying CAD.We prospectively identified patients at the time of their initial clinical presentation of CAD who had either an acute MI or stable exertional angina. We collected clinical data and genotyped 34 polymorphisms in 6 genes (ANGPT1, HIF1A, THBS1, VEGFA, VEGFC, VEGFR2).The 909 patients with acute MI were significantly more likely than the 466 patients with stable angina to be male, current smokers, and hypertensive, and less likely to be taking beta-blockers or statins. Three polymorphisms in HIF1A (Pro582Ser, rs11549465; rs1087314; and Thr418Ile, rs41508050) were significantly more common in patients who presented with stable exertional angina rather than acute MI, even after statistical adjustment for cardiac risk factors and medications. The HIF-mediated transcriptional activity was significantly lower when HIF1A null fibroblasts were transfected with variant HIF1A alleles than with wild-type HIF1A alleles.Polymorphisms in HIF1A were associated with development of stable exertional angina rather than acute MI as the initial clinical presentation of CAD.
View details for DOI 10.1016/j.ahj.2007.07.042
View details for Web of Science ID 000251396200006
View details for PubMedID 18035072
-
PATHWISE COORDINATE OPTIMIZATION
ANNALS OF APPLIED STATISTICS
2007; 1 (2): 302-332
View details for DOI 10.1214/07-AOAS131
View details for Web of Science ID 000261057600003
-
Anti-idiotype antibody response afteir vaccination correlates with better overall survival in follicular lymphoma
49th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2007: 199A–199A
View details for Web of Science ID 000251100800648
-
Survival in follicular lymphoma: The Stanford experience, 1960-2003.
49th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2007: 1005A–1005A
View details for Web of Science ID 000251100804465
-
LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma in, the pre- and post-rituximab treatment eras
49th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2007: 24A–24A
View details for Web of Science ID 000251100800053
-
Major histocomplatibility class II (MHCII) and germinal center associated gene expression correlate with overall survival in ritiximab and CHOP-like treated diffuse large B.cell lymphoma (DLBCL) patients, using
49th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2007: 23A–23A
View details for Web of Science ID 000251100800050
-
Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins
NATURE MEDICINE
2007; 13 (11): 1359-1362
Abstract
A molecular test for Alzheimer's disease could lead to better treatment and therapies. We found 18 signaling proteins in blood plasma that can be used to classify blinded samples from Alzheimer's and control subjects with close to 90% accuracy and to identify patients who had mild cognitive impairment that progressed to Alzheimer's disease 2-6 years later. Biological analysis of the 18 proteins points to systemic dysregulation of hematopoiesis, immune responses, apoptosis and neuronal support in presymptomatic Alzheimer's disease.
View details for DOI 10.1038/nm1653
View details for Web of Science ID 000250736900029
View details for PubMedID 17934472
-
On the "degrees of freedom" of the lasso
ANNALS OF STATISTICS
2007; 35 (5): 2173-2192
View details for DOI 10.1214/009053607000000127
View details for Web of Science ID 000251096100013
-
Expression and prognostic significance of a panel of tissue hypoxia markers in head-and-neck squamous cell carcinomas
48th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO)
ELSEVIER SCIENCE INC. 2007: 167–75
Abstract
To investigate the expression pattern of hypoxia-induced proteins identified as being involved in malignant progression of head-and-neck squamous cell carcinoma (HNSCC) and to determine their relationship to tumor pO(2) and prognosis.We performed immunohistochemical staining of hypoxia-induced proteins (carbonic anhydrase IX [CA IX], BNIP3L, connective tissue growth factor, osteopontin, ephrin A1, hypoxia inducible gene-2, dihydrofolate reductase, galectin-1, IkappaB kinase beta, and lysyl oxidase) on tumor tissue arrays of 101 HNSCC patients with pretreatment pO(2) measurements. Analysis of variance and Fisher's exact tests were used to evaluate the relationship between marker expression, tumor pO(2), and CA IX staining. Cox proportional hazard model and log-rank tests were used to determine the relationship between markers and prognosis.Osteopontin expression correlated with tumor pO(2) (Eppendorf measurements) (p = 0.04). However, there was a strong correlation between lysyl oxidase, ephrin A1, and galectin-1 and CA IX staining. These markers also predicted for cancer-specific survival and overall survival on univariate analysis. A hypoxia score of 0-5 was assigned to each patient, on the basis of the presence of strong staining for these markers, whereby a higher score signifies increased marker expression. On multivariate analysis, increasing hypoxia score was an independent prognostic factor for cancer-specific survival (p = 0.015) and was borderline significant for overall survival (p = 0.057) when adjusted for other independent predictors of outcomes (hemoglobin and age).We identified a panel of hypoxia-related tissue markers that correlates with treatment outcomes in HNSCC. Validation of these markers will be needed to determine their utility in identifying patients for hypoxia-targeted therapy.
View details for DOI 10.1016/j.ijrobp.2007.01.071
View details for PubMedID 17707270
-
Notch signals positively regulate activity of the mTOR pathway in T-cell acute lymphoblastic leukemia
BLOOD
2007; 110 (1): 278-286
Abstract
Constitutive Notch activation is required for the proliferation of a subgroup of T-cell acute lymphoblastic leukemia (T-ALL). Downstream pathways that transmit pro-oncogenic signals are not well characterized. To identify these pathways, protein microarrays were used to profile the phosphorylation state of 108 epitopes on 82 distinct signaling proteins in a panel of 13 T-cell leukemia cell lines treated with a gamma-secretase inhibitor (GSI) to inhibit Notch signals. The microarray screen detected GSI-induced hypophosphorylation of multiple signaling proteins in the mTOR pathway. This effect was rescued by expression of the intracellular domain of Notch and mimicked by dominant negative MAML1, confirming Notch specificity. Withdrawal of Notch signals prevented stimulation of the mTOR pathway by mitogenic factors. These findings collectively suggest that the mTOR pathway is positively regulated by Notch in T-ALL cells. The effect of GSI on the mTOR pathway was independent of changes in phosphatidylinositol-3 kinase and Akt activity, but was rescued by expression of c-Myc, a direct transcriptional target of Notch, implicating c-Myc as an intermediary between Notch and mTOR. T-ALL cell growth was suppressed in a highly synergistic manner by simultaneous treatment with the mTOR inhibitor rapamycin and GSI, which represents a rational drug combination for treating this aggressive human malignancy.
View details for DOI 10.1182/blood-2006-08-039883
View details for Web of Science ID 000247611000041
View details for PubMedID 17363738
View details for PubMedCentralID PMC1896117
-
Extracting binary signals from microarray time-course data
NUCLEIC ACIDS RESEARCH
2007; 35 (11): 3705-3712
Abstract
This article presents a new method for analyzing microarray time courses by identifying genes that undergo abrupt transitions in expression level, and the time at which the transitions occur. The algorithm matches the sequence of expression levels for each gene against temporal patterns having one or two transitions between two expression levels. The algorithm reports a P-value for the matching pattern of each gene, and a global false discovery rate can also be computed. After matching, genes can be sorted by the direction and time of transitions. Genes can be partitioned into sets based on the direction and time of change for further analysis, such as comparison with Gene Ontology annotations or binding site motifs. The method is evaluated on simulated and actual time-course data. On microarray data for budding yeast, it is shown that the groups of genes that change in similar ways and at similar times have significant and relevant Gene Ontology annotations.
View details for DOI 10.1093/nar/gkm284
View details for PubMedID 17517782
-
ON TESTING THE SIGNIFICANCE OF SETS OF GENES
ANNALS OF APPLIED STATISTICS
2007; 1 (1): 107-129
View details for DOI 10.1214/07-AOAS101
View details for Web of Science ID 000261050400006
-
Oncogenic regulators and substrates of the anaphase promoting complex/cyclosome are frequently overexpressed in malignant tumors
AMERICAN JOURNAL OF PATHOLOGY
2007; 170 (5): 1793-1805
Abstract
The fidelity of cell division is dependent on the accumulation and ordered destruction of critical protein regulators. By triggering the appropriately timed, ubiquitin-dependent proteolysis of the mitotic regulatory proteins securin, cyclin B, aurora A kinase, and polo-like kinase 1, the anaphase promoting complex/cyclosome (APC/C) ubiquitin ligase plays an essential role in maintaining genomic stability. Misexpression of these APC/C substrates, individually, has been implicated in genomic instability and cancer. However, no comprehensive survey of the extent of their misregulation in tumors has been performed. Here, we analyzed more than 1600 benign and malignant tumors by immunohistochemical staining of tissue microarrays and found frequent overexpression of securin, polo-like kinase 1, aurora A, and Skp2 in malignant tumors. Positive and negative APC/C regulators, Cdh1 and Emi1, respectively, were also more strongly expressed in malignant versus benign tumors. Clustering and statistical analysis supports the finding that malignant tumors generally show broad misregulation of mitotic APC/C substrates not seen in benign tumors, suggesting that a "mitotic profile" in tumors may result from misregulation of the APC/C destruction pathway. This profile of misregulated mitotic APC/C substrates and regulators in malignant tumors suggests that analysis of this pathway may be diagnostically useful and represent a potentially important therapeutic target.
View details for DOI 10.2353/ajpath.2007.060767
View details for PubMedID 17456782
-
Disease-specific genomic analysis: identifying the signature of pathologic biology
BIOINFORMATICS
2007; 23 (8): 957-965
Abstract
Genomic high-throughput technology generates massive data, providing opportunities to understand countless facets of the functioning genome. It also raises profound issues in identifying data relevant to the biology being studied.We introduce a method for the analysis of pathologic biology that unravels the disease characteristics of high dimensional data. The method, disease-specific genomic analysis (DSGA), is intended to precede standard techniques like clustering or class prediction, and enhance their performance and ability to detect disease. DSGA measures the extent to which the disease deviates from a continuous range of normal phenotypes, and isolates the aberrant component of data. In several microarray cancer datasets, we show that DSGA outperforms standard methods. We then use DSGA to highlight a novel subdivision of an important class of genes in breast cancer, the estrogen receptor (ER) cluster. We also identify new markers distinguishing ductal and lobular breast cancers. Although our examples focus on microarrays, DSGA generalizes to any high dimensional genomic/proteomic data.
View details for DOI 10.1093/bioinformatics/btm033
View details for Web of Science ID 000246293000006
View details for PubMedID 17277331
-
Averaged gene expressions for regression
BIOSTATISTICS
2007; 8 (2): 212-227
Abstract
Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.
View details for DOI 10.1093/biostatistics/kxl002
View details for Web of Science ID 000245512000004
View details for PubMedID 16698769
-
Microvessel density and expression of vascular endothelial growth factor and its receptors in diffuse large B-cell lymphoma subtypes
AMERICAN JOURNAL OF PATHOLOGY
2007; 170 (4): 1362-1369
Abstract
Angiogenesis is known to play a major role in neoplasia, including hematolymphoid neoplasia. We assessed the relationships among angiogenesis and expression of vascular endothelial growth factor and its receptors in the context of clinically and biologically relevant subtypes of diffuse large B-cell lymphoma using immunohistochemical evaluation of tissue microarrays. We found that diffuse large B-cell lymphoma specimens showing higher local vascular endothelial growth factor expression showed correspondingly higher microvessel density, implying that lymphoma cells induce local tumor angiogenesis. In addition, local vascular endothelial growth factor expression was higher in those specimens showing higher expression of the receptors of the growth factor, suggesting an autocrine growth-promoting feedback loop. The germinal center-like and nongerminal center-like subtypes of diffuse large B-cell lymphoma were biologically and prognostically distinct. Interestingly, only in the more clinically aggressive nongerminal center-like subtype were microvessel densities significantly higher in specimens showing higher vascular endothelial growth factor expression; the same was true for the finding of higher vascular endothelial growth factor receptor-1 expression in conjunction with higher vascular endothelial growth factor expression. These differences may have important implications for the responsiveness of the two diffuse large B-cell lymphoma subtypes to anti-vascular endothelial growth factor and anti-angiogenic therapies.
View details for DOI 10.2353/ajpath.2007.060901
View details for Web of Science ID 000245233000022
View details for PubMedID 17392174
View details for PubMedCentralID PMC1829468
-
Margin trees for high-dimensional classification
JOURNAL OF MACHINE LEARNING RESEARCH
2007; 8: 637-652
View details for Web of Science ID 000247002700009
-
Outlier sums for differential gene expression analysis
BIOSTATISTICS
2007; 8 (1): 2-8
Abstract
We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).
View details for DOI 10.1093/biostatistics/kx1005
View details for Web of Science ID 000242715400001
View details for PubMedID 16702229
-
Forward stagewise regression and the monotone lasso
ELECTRONIC JOURNAL OF STATISTICS
2007; 1: 1-29
View details for DOI 10.1214/07-EJS004
View details for Web of Science ID 000207854200001
-
Regularized linear discriminant analysis and its application in microarrays
BIOSTATISTICS
2007; 8 (1): 86-100
Abstract
In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN (http://www.r-project.org) for download and testing.
View details for DOI 10.1093/biostatistics/kxj035
View details for Web of Science ID 000242715400006
View details for PubMedID 16603682
-
Are clusters found in one dataset present in another dataset?
BIOSTATISTICS
2007; 8 (1): 9-31
Abstract
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).
View details for DOI 10.1093/biostatistics/kxj029
View details for Web of Science ID 000242715400002
View details for PubMedID 16613834
-
Tumor-infiltrating T cells are not predictive of clinical outcome in follicular lymphoma.
48th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2006: 247A–248A
View details for Web of Science ID 000242440001084
-
Preliminary report on a phase I/II study of intraturnoral injection of PF-3512676 (CpG 7909), a TLR9 agonist, combined with radiation in recurrent low-grade lymphomas.
48th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2006: 767A–768A
View details for Web of Science ID 000242440003505
-
Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer
GENES CHROMOSOMES & CANCER
2006; 45 (11): 1033-1040
Abstract
Breast cancer is a leading cause of cancer-death among women, where the clinicopathological features of tumors are used to prognosticate and guide therapy. DNA copy number alterations (CNAs), which occur frequently in breast cancer and define key pathogenetic events, are also potentially useful prognostic or predictive factors. Here, we report a genome-wide array-based comparative genomic hybridization (array CGH) survey of CNAs in 89 breast tumors from a patient cohort with locally advanced disease. Statistical analysis links distinct cytoband loci harboring CNAs to specific clinicopathological parameters, including tumor grade, estrogen receptor status, presence of TP53 mutation, and overall survival. Notably, distinct spectra of CNAs also underlie the different subtypes of breast cancer recently defined by expression-profiling, implying these subtypes develop along distinct genetic pathways. In addition, higher numbers of gains/losses are associated with the "basal-like" tumor subtype, while high-level DNA amplification is more frequent in "luminal-B" subtype tumors, suggesting also that distinct mechanisms of genomic instability might underlie their pathogenesis. The identified CNAs may provide a basis for improved patient prognostication, as well as a starting point to define important genes to further our understanding of the pathobiology of breast cancer. This article contains Supplementary Material available at http://www.interscience.wiley.com/jpages/1045-2257/suppmat
View details for DOI 10.1002/gcc.20366
View details for Web of Science ID 000240601400005
View details for PubMedID 16897746
-
Discovery and validation of breast cancer subtypes
BMC GENOMICS
2006; 7
Abstract
Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+.Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability.As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.
View details for DOI 10.1186/1471-2164-7-231
View details for Web of Science ID 000240732900001
View details for PubMedID 16965636
View details for PubMedCentralID PMC1574316
-
Global transcriptional response to interferon is a determinant of HCV treatment outcome and is modified by race
HEPATOLOGY
2006; 44 (2): 352-359
Abstract
Interferon (IFN)-alpha-based therapy for chronic hepatitis C is effective in fewer than 50% of all treated patients, with a substantially lower response rate in black patients. The goal of this study was to investigate the underlying host transcriptional response associated with interferon treatment outcomes. We collected peripheral blood mononuclear cells from chronic hepatitis C patients before initiation of IFN-alpha therapy and incubated the cells with or without IFN-alpha for 6 hours, followed by microarray assay to identify IFN-induced gene transcription. The microarray datasets were analyzed statistically according to the patients' race and virological responses to subsequent IFN-alpha treatment. The global induction of IFN-stimulated genes (ISGs) was significantly greater in sustained virological responders compared with nonresponders and in white patients compared with black patients. In addition, a significantly greater global induction of ISGs was observed in sustained virological responders compared with nonresponders within the group of white patients. The level of IFN-induced signal transducer and activator of transcription (STAT) 1 activation, a key component of the Janus kinase (JAK)-STAT signaling pathway, correlated with the global induction of ISGs and was significantly higher in white patients than in black patients. In conclusion, both treatment outcome and race are associated with different transcriptional responses to IFN-alpha. Because this difference is evident in the global induction of ISGs rather than a selective effect on a subset of such genes, key factors affecting the outcome of IFN-alpha therapy are likely to act at the JAK-STAT pathway that controls transcription of downstream ISGs.
View details for DOI 10.1002/hep.21267
View details for PubMedID 16871572
-
Sparse principal component analysis
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2006; 15 (2): 265-286
View details for DOI 10.1198/106186006X113430
View details for Web of Science ID 000238044400001
-
A tail strength measure for assessing the overall univariate significance in a dataset
BIOSTATISTICS
2006; 7 (2): 167-181
Abstract
We propose an overall measure of significance for a set of hypothesis tests. The 'tail strength' is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.
View details for DOI 10.1093/biostatistics/kxj009
View details for Web of Science ID 000236436300001
View details for PubMedID 16332926
-
Hybrid hierarchical clustering with applications to microarray data
BIOSTATISTICS
2006; 7 (2): 286-301
Abstract
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.
View details for DOI 10.1093/biostatistics/kxj007
View details for Web of Science ID 000236436300009
View details for PubMedID 16301308
-
A simple method for assessing sample sizes in microarray experiments
BMC BIOINFORMATICS
2006; 7
Abstract
In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments.Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes.Our method seems to be useful for sample size assessment in microarray experiments.
View details for DOI 10.1186/1471-2105-7-106
View details for Web of Science ID 000237138600001
View details for PubMedID 16512900
View details for PubMedCentralID PMC1450307
-
An evaluation of tumor oxygenation and gene expression in patients with early stage non-small cell lung cancers
CLINICAL CANCER RESEARCH
2006; 12 (5): 1507-1514
Abstract
To directly assess tumor oxygenation in resectable non-small cell lung cancers (NSCLC) and to correlate tumor pO2 and the selected gene and protein expression to treatment outcomes.Twenty patients with resectable NSCLC were enrolled. Intraoperative measurements of normal lung and tumor pO2 were done with the Eppendorf polarographic electrode. All patients had plasma osteopontin measurements by ELISA. Carbonic anhydrase-IX (CA IX) staining of tumor sections was done in the majority of patients (n = 16), as was gene expression profiling (n = 12) using cDNA microarrays. Tumor pO2 was correlated with CA IX staining, osteopontin levels, and treatment outcomes.The median tumor pO2 ranged from 0.7 to 46 mm Hg (median, 16.6) and was lower than normal lung pO2 in all but one patient. Because both variables were affected by the completeness of lung deflation during measurement, we used the ratio of tumor/normal lung (T/L) pO2 as a reflection of tumor oxygenation. The median T/L pO2 was 0.13. T/L pO2 correlated significantly with plasma osteopontin levels (r = 0.53, P = 0.02) and CA IX expression (P = 0.006). Gene expression profiling showed that high CD44 expression was a predictor for relapse, which was confirmed by tissue staining of CD44 variant 6 protein. Other variables associated with the risk of relapse were T stage (P = 0.02), T/L pO2 (P = 0.04), and osteopontin levels (P = 0.001).Tumor hypoxia exists in resectable NSCLC and is associated with elevated expression of osteopontin and CA IX. Tumor hypoxia and elevated osteopontin levels and CD44 expression correlated with poor prognosis. A larger study is needed to confirm the prognostic significance of these factors.
View details for DOI 10.1158/1078-0432.CCR-05-2049
View details for PubMedID 16533775
-
Prediction by supervised principal components
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2006; 101 (473): 119-137
View details for DOI 10.1198/016214505000000628
View details for Web of Science ID 000235958400016
-
Changes of gene expression in gastric preneoplasia following Helicobacter pylori eradication therapy
CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION
2006; 15 (2): 272-280
Abstract
Helicobacter pylori causes gastric preneoplasia and neoplasia. Eradicating H. pylori can result in partial regression of preneoplastic lesions; however, the molecular underpinning of this change is unknown. To identify molecular changes in the gastric mucosa following H. pylori eradication, we used cDNA microarrays (with each array containing approximately 30,300 genes) to analyze 54 gastric biopsies from a randomized, placebo-controlled trial of H. pylori therapy. The 54 biopsies were obtained from 27 subjects (13 from the treatment and 14 from the placebo group) with chronic gastritis, atrophy, and/or intestinal metaplasia. Each subject contributed one biopsy before and another biopsy 1 year after the intervention. Significant analysis of microarrays (SAM) was used to compare the gene expression profiles of pre-intervention and post-intervention biopsies. In the treatment group, SAM identified 30 genes whose expression changed significantly from baseline to 1 year after treatment (0 up-regulated and 30 down-regulated). In the placebo group, the expression of 55 genes differed significantly over the 1-year period (32 up-regulated and 23 down-regulated). Five genes involved in cell-cell adhesion and lining (TACSTD1 and MUC13), cell cycle differentiation (S100A10), and lipid metabolism and transport (FABP1 and MTP) were down-regulated over time in the treatment group but up-regulated in the placebo group. Immunohistochemistry for one of these differentially expressed genes (FABP1) confirmed the changes in gene expression observed by microarray. In conclusion, H. pylori eradication may stop or reverse ongoing molecular processes in the stomach. Further studies are needed to evaluate the use of these genes as markers for gastric cancer risk.
View details for DOI 10.1158/1055-9965.EPI-05-0362
View details for Web of Science ID 000235587200012
View details for PubMedID 16492915
-
Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification
ONCOGENE
2006; 25 (1): 130-138
Abstract
DNA amplifications and deletions frequently contribute to the development and progression of lung cancer. To identify such novel alterations in small cell lung cancer (SCLC), we performed comparative genomic hybridization on a set of 24 SCLC cell lines, using cDNA microarrays representing approximately 22,000 human genes (providing an average mapping resolution of <70 kb). We identified localized DNA amplifications corresponding to oncogenes known to be amplified in SCLC, including MYC (8q24), MYCN (2p24) and MYCL1 (1p34). Additional highly localized DNA amplifications suggested candidate oncogenes not previously identified as amplified in SCLC, including the antiapoptotic genes TNFRSF4 (1p36), DAD1 (14q11), BCL2L1 (20q11) and BCL2L2 (14q11). Likewise, newly discovered PCR-validated homozygous deletions suggested candidate tumor-suppressor genes, including the proapoptotic genes MAPK10 (4q21) and TNFRSF6 (10q23). To characterize the effect of DNA amplification on gene expression patterns, we performed expression profiling using the same microarray platform. Among our findings, we identified sets of genes whose expression correlated with MYC, MYCN or MYCL1 amplification, with surprisingly little overlap among gene sets. While both MYC and MYCN amplification were associated with increased and decreased expression of known MYC upregulated and downregulated targets, respectively, MYCL1 amplification was associated only with the latter. Our findings support a role of altered apoptotic balance in the pathogenesis of SCLC, and suggest that MYC family genes might affect oncogenesis through distinct sets of targets, in particular implicating the importance of transcriptional repression.
View details for DOI 10.1038/sj.onc.1208997
View details for Web of Science ID 000234406400014
View details for PubMedID 16116477
-
Gene expression profiling predicts survival in conventional renal cell carcinoma
PLOS MEDICINE
2006; 3 (1): 115-124
Abstract
Conventional renal cell carcinoma (cRCC) accounts for most of the deaths due to kidney cancer. Tumor stage, grade, and patient performance status are used currently to predict survival after surgery. Our goal was to identify gene expression features, using comprehensive gene expression profiling, that correlate with survival.Gene expression profiles were determined in 177 primary cRCCs using DNA microarrays. Unsupervised hierarchical clustering analysis segregated cRCC into five gene expression subgroups. Expression subgroup was correlated with survival in long-term follow-up and was independent of grade, stage, and performance status. The tumors were then divided evenly into training and test sets that were balanced for grade, stage, performance status, and length of follow-up. A semisupervised learning algorithm (supervised principal components analysis) was applied to identify transcripts whose expression was associated with survival in the training set, and the performance of this gene expression-based survival predictor was assessed using the test set. With this method, we identified 259 genes that accurately predicted disease-specific survival among patients in the independent validation group (p < 0.001). In multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status (p < 0.001).cRCC displays molecular heterogeneity and can be separated into gene expression subgroups that correlate with survival after surgery. We have identified a set of 259 genes that predict survival after surgery independent of clinical prognostic factors.
View details for DOI 10.1371/journal.pmed.0030013
View details for Web of Science ID 000236342700020
View details for PubMedID 16318415
View details for PubMedCentralID PMC1298943
-
Autoantibody profiling of lupus mice deficient for interferon signaling components.
6th Annual Meeting of the Federation-of-Clinical-Immunology-Societies
ACADEMIC PRESS INC ELSEVIER SCIENCE. 2006: S72–S73
View details for DOI 10.1016/j.clim.2006.04.102
View details for Web of Science ID 000237924300184
-
Gene expression profiling differentiates germ cell tumors from other cancers and defines subtype-specific signatures
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (49): 17763-17768
Abstract
Germ cell tumors (GCTs) of the testis are the predominant cancer among young men. We analyzed gene expression profiles of 50 GCTs of various subtypes, and we compared them with 443 other common malignant tumors of epithelial, mesenchymal, and lymphoid origins. Significant differences in gene expression were found among major histological subtypes of GCTs, and between them and other malignancies. We identified 511 genes, belonging to several critical functional groups such as cell cycle progression, cell proliferation, and apoptosis, to be significantly differentially expressed in GCTs compared with other tumor types. Sixty-five genes were sufficient for the construction of a GCT class predictor of high predictive accuracy (100% training set, 96% test set), which might be useful in the diagnosis of tumors of unknown primary origin. Previously described diagnostic and prognostic markers were found to be expressed by the appropriate GCT subtype (AFP, POU5F1, POV1, CCND2, and KIT). Several additional differentially expressed genes were identified in teratomas (EGR1 and MMP7), yolk sac tumors (PTPN13 and FN1), and seminomas (NR6A1, DPPA4, and IRX1). Dynamic computation of interaction networks and mapping to existing pathways knowledge databases revealed a potential role of EGR1 in p21-induced cell cycle arrest and intrinsic chemotherapy resistance of mature teratomas.
View details for DOI 10.1073/pnas.0509082102
View details for PubMedID 16306258
-
Differential gene expression profiles in CD34+myelodysplastic syndrome marrow cells.
47th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2005: 956A–956A
View details for Web of Science ID 000233426006208
-
Gene expression profiling and FLT3 status correlate with outcome in de novo acute myeloid leukemia (AML) with normal karyotype: Results of children's oncology group (COG) study POG #9421.
47th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2005: 667A–667A
View details for Web of Science ID 000233426004219
-
Cluster validation by prediction strength
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2005; 14 (3): 511-528
View details for DOI 10.1198/106186005X59243
View details for Web of Science ID 000235042000001
-
Signature patterns of gene expression in mouse atherosclerosis and their correlation to human coronary disease
PHYSIOLOGICAL GENOMICS
2005; 22 (2): 213-226
Abstract
The propensity for developing atherosclerosis is dependent on underlying genetic risk and varies as a function of age and exposure to environmental risk factors. Employing three mouse models with different disease susceptibility, two diets, and a longitudinal experimental design, it was possible to manipulate each of these factors to focus analysis on genes most likely to have a specific disease-related function. To identify differences in longitudinal gene expression patterns of atherosclerosis, we have developed and employed a statistical algorithm that relies on generalized regression and permutation analysis. Comprehensive annotation of the array with ontology and pathway terms has allowed rigorous identification of molecular and biological processes that underlie disease pathophysiology. The repertoire of atherosclerosis-related immunomodulatory genes has been extended, and additional fundamental pathways have been identified. This highly disease-specific group of mouse genes was combined with an extensive human coronary artery data set to identify a shared group of genes differentially regulated among atherosclerotic tissues from different species and different vascular beds. A small core subset of these differentially regulated genes was sufficient to accurately classify various stages of the disease in mouse. The same gene subset was also found to accurately classify human coronary lesion severity. In addition, this classifier gene set was able to distinguish with high accuracy atherectomy specimens from native coronary artery disease vs. those collected from in-stent restenosis lesions, thus identifying molecular differences between these two processes. These studies significantly focus efforts aimed at identifying central gene regulatory pathways that mediate atherosclerotic disease, and the identification of classification gene sets offers unique insights into potential diagnostic and therapeutic strategies in atherosclerotic disease.
View details for DOI 10.1152/physiolgenomics.00001.2005
View details for Web of Science ID 000230987900011
View details for PubMedID 15870398
-
Array-based comparative genomic hybridization identifies localized DNA amplifications and homozygous deletions in pancreatic cancer
NEOPLASIA
2005; 7 (6): 556-562
Abstract
Pancreatic cancer, the fourth leading cause of cancer death in the United States, is frequently associated with the amplification and deletion of specific oncogenes and tumor-suppressor genes (TSGs), respectively. To identify such novel alterations and to discover the underlying genes, we performed comparative genomic hybridization on a set of 22 human pancreatic cancer cell lines, using cDNA microarrays measuring approximately 26,000 human genes (thereby providing an average mapping resolution of <60 kb). To define the subset of amplified and deleted genes with correspondingly altered expression, we also profiled mRNA levels in parallel using the same cDNA microarray platform. In total, we identified 14 high-level amplifications (38-4934 kb in size) and 15 homozygous deletions (46-725 kb). We discovered novel localized amplicons, suggesting previously unrecognized candidate oncogenes at 6p21, 7q21 (SMURF1, TRRAP), 11q22 (BIRC2, BIRC3), 12p12, 14q24 (TGFB3), 17q12, and 19q13. Likewise, we identified novel polymerase chain reaction-validated homozygous deletions indicating new candidate TSGs at 6q25, 8p23, 8p22 (TUSC3), 9q33 (TNC, TNFSF15), 10q22, 10q24 (CHUK), 11p15 (DKK3), 16q23, 18q23, 21q22 (PRDM15, ANKRD3), and Xp11. Our findings suggest candidate genes and pathways, which may contribute to the development or progression of pancreatic cancer.
View details for DOI 10.1593/neo.04586
View details for Web of Science ID 000230209600002
View details for PubMedID 16036106
View details for PubMedCentralID PMC1501288
-
Genome-wide characterization of gene expression variations and DNA copy number changes in prostate cancer cell lines
PROSTATE
2005; 63 (2): 187-197
Abstract
The aim of this study was to characterize gene expression and DNA copy number profiles in androgen sensitive (AS) and androgen insensitive (AI) prostate cancer cell lines on a genome-wide scale.Gene expression profiles and DNA copy number changes were examined using DNA microarrays in eight commonly used prostate cancer cell lines. Chromosomal regions with DNA copy number changes were identified using cluster along chromosome (CLAC).There were discrete differences in gene expression patterns between AS and AI cells that were not limited to androgen-responsive genes. AI cells displayed more DNA copy number changes, especially amplifications, than AS cells. The gene expression profiles of cell lines showed limited similarities to prostate tumors harvested at surgery.AS and AI cell lines are different in their transcriptional programs and degree of DNA copy number alterations. This dataset provides a context for the use of prostate cancer cell lines as models for clinical cancers.
View details for DOI 10.1002/pros.20158
View details for PubMedID 15486987
-
Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (10): 3738-3743
Abstract
Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this "wound-response signature" to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering ("molecular subtypes") or supervised predictors of metastasis ("70-gene prognosis signature").
View details for DOI 10.1073/pnas.0409462102
View details for PubMedID 15701700
-
Mouse strain-specific differences in vascular wall gene expression and their relationship to vascular disease
ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY
2005; 25 (2): 302-308
Abstract
Different strains of inbred mice exhibit different susceptibility to the development of atherosclerosis. The C3H/HeJ and C57Bl/6 mice have been used in several studies aimed at understanding the genetic basis of atherosclerosis. Under controlled environmental conditions, variations in susceptibility to atherosclerosis reflect differences in genetic makeup, and these differences must be reflected in gene expression patterns that are temporally related to the development of disease. In this study, we sought to identify the genetic pathways that are differentially activated in the aortas of these mice.We performed genome-wide transcriptional profiling of aortas from C3H/HeJ and C57Bl/6 mice. Differences in gene expression were identified at baseline as well as during normal aging and longitudinal exposure to high-fat diet. The significance of these genes to the development of atherosclerosis was evaluated by observing their temporal pattern of expression in the well-studied apolipoprotein E model of atherosclerosis.Gene expression differences between the 2 strains suggest that aortas of C57Bl/6 mice have a higher genetic propensity to develop inflammation in response to appropriate atherogenic stimuli. This study expands the repertoire of factors in known disease-related signaling pathways and identifies novel candidate genes for future study. To gain insights into the molecular pathways that are differentially activated in strains of mice with varied susceptibility to atherosclerosis, we performed comprehensive transcriptional profiling of their vascular wall. Genes identified through these studies expand the repertoire of factors in disease-related signaling pathways and identify novel candidate genes in atherosclerosis.
View details for DOI 10.1161/011.ATV.0000151372.86863.a5
View details for Web of Science ID 000226594000009
View details for PubMedID 15550693
-
A method for calling gains and losses in array CGH data
BIOSTATISTICS
2005; 6 (1): 45-58
Abstract
Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.
View details for DOI 10.1093/biostatistics/kxh017
View details for Web of Science ID 000226346300005
View details for PubMedID 15618527
-
Early detection of breast cancer based on gene-expression patterns in peripheral blood cells
BREAST CANCER RESEARCH
2005; 7 (5): R634-R644
Abstract
Existing methods to detect breast cancer in asymptomatic patients have limitations, and there is a need to develop more accurate and convenient methods. In this study, we investigated whether early detection of breast cancer is possible by analyzing gene-expression patterns in peripheral blood cells.Using macroarrays and nearest-shrunken-centroid method, we analyzed the expression pattern of 1,368 genes in peripheral blood cells of 24 women with breast cancer and 32 women with no signs of this disease. The results were validated using a standard leave-one-out cross-validation approach.We identified a set of 37 genes that correctly predicted the diagnostic class in at least 82% of the samples. The majority of these genes had a decreased expression in samples from breast cancer patients, and predominantly encoded proteins implicated in ribosome production and translation control. In contrast, the expression of some defense-related genes was increased in samples from breast cancer patients.The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients. Additional studies with a large sample size, from women both with and without the disease, are warranted to confirm or refute this finding.
View details for DOI 10.1186/bcr1203
View details for Web of Science ID 000232332200021
View details for PubMedID 16168108
View details for PubMedCentralID PMC1242124
-
The 'miss rate' for the analysis of gene expression data
BIOSTATISTICS
2005; 6 (1): 111-117
Abstract
Multiple testing issues are important in gene expression studies, where typically thousands of genes are compared over two or more experimental conditions. The false discovery rate has become a popular measure in this setting. Here we discuss a complementary measure, the 'miss rate', and show how to estimate it in practice.
View details for DOI 10.1093/biostatistics/kxh021
View details for Web of Science ID 000226346300009
View details for PubMedID 15618531
-
Sparsity and smoothness via the fused lasso
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2005; 67: 91-108
View details for Web of Science ID 000225686900006
-
CSF1 expression signature identifies a subset of breast carcinomas and influences outcome.
28th Annual San Antonio Breast Cancer Symposium
SPRINGER. 2005: S135–S135
View details for Web of Science ID 000233407100364
-
Sample classification from protein mass spectrometry, by 'peak probability contrasts'
BIOINFORMATICS
2004; 20 (17): 3034-3044
Abstract
Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers.Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.
View details for DOI 10.1093/bioinformatics/bth357
View details for Web of Science ID 000225361400017
View details for PubMedID 15226172
-
The percentage of tumor-infiltrating T cells is not correlated with overall survival in follicular B-cell lymphomas
46th Annual Meeting of the American-Society-of-Hematology
AMER SOC HEMATOLOGY. 2004: 891A–891A
View details for Web of Science ID 000225127503264
-
Gene expression profiles at diagnosis in de novo childhood AML patients identify FLT3 mutations with good clinical outcomes
BLOOD
2004; 104 (9): 2646-2654
Abstract
Fms-like tyrosine kinase 3 (FLT3) mutations are associated with unfavorable outcomes in children with acute myeloid leukemia (AML). We used DNA microarrays to identify gene expression profiles related to FLT3 status and outcome in childhood AML. Among 81 diagnostic specimens, 36 had FLT3 mutations (FLT3-MUs), 24 with internal tandem duplications (ITDs) and 12 with activating loop mutations (ALMs). In addition, 8 of 19 specimens from patients with relapses had FLT3-MUs. Predictive analysis of microarrays (PAM) identified genes that differentiated FLT3-ITD from FLT3-ALM and FLT3 wild-type (FLT3-WT) cases. Among the 42 specimens with FLT3-MUs, PAM identified 128 genes that correlated with clinical outcome. Event-free survival (EFS) in FLT3-MU patients with a favorable signature was 45% versus 5% for those with an unfavorable signature (P = .018). Among FLT3-MU specimens, high expression of the RUNX3 gene and low expression of the ATRX gene were associated with inferior outcome. The ratio of RUNX3 to ATRX expression was used to classify FLT3-MU cases into 3 EFS groups: 70%, 37%, and 0% for low, intermediate, and high ratios, respectively (P < .0001). Thus, gene expression profiling identified AML patients with divergent prognoses within the FLT3-MU group, and the RUNX3 to ATRX expression ratio should be a useful prognostic indicator in these patients.
View details for DOI 10.1182/blood-2004-12-4449
View details for PubMedID 15251987
-
The entire regularization path for the support vector machine
JOURNAL OF MACHINE LEARNING RESEARCH
2004; 5: 1391-1415
View details for Web of Science ID 000236328300007
-
Developmental response to hypoxia
FASEB JOURNAL
2004; 18 (12): 1348-1365
Abstract
Molecular mechanisms underlying fetal growth restriction due to placental insufficiency and in utero hypoxia are not well understood. In the current study, time-dependent (3 h-11 days) changes in fetal tissue gene expression in a rat model of in utero hypoxia compared with normoxic controls were investigated as an initial approach to understand molecular events underlying fetal development in response to hypoxia. Under hypoxic conditions, litter size was reduced and IGFBP-1 was up-regulated in maternal serum and in fetal liver and heart. Tissue-specific, distinct regulatory patterns of gene expression were observed under acute vs. chronic hypoxic conditions. Induction of glycolytic enzymes was an early event in response to hypoxia during organ development; consistently, tissue-specific induction of calcium homeostasis-related genes and suppression of growth-related genes were observed, suggesting mechanisms underlying hypoxia-related fetal growth restriction. Furthermore, induction of inflammation-related genes in placentas exposed to long-term hypoxia (11 days) suggests a mechanism for placental dysfunction and impaired pregnancy outcome accompanying in utero hypoxia.
View details for DOI 10.1096/fj.03-1377com
View details for Web of Science ID 000224243200054
View details for PubMedID 15333578
-
Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage
Annual Scientific Meeting on Exporing Genomics in Radiation Oncology
ELSEVIER IRELAND LTD. 2004: S29–S29
View details for Web of Science ID 000225708500095
-
The use of plasma surface-enhanced laser desorption/ionization time-of-flight mass spectrometry proteomic patterns for detection of head and neck squamous cell cancers
45th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO)
AMER ASSOC CANCER RESEARCH. 2004: 4806–12
Abstract
Our study was undertaken to determine the utility of plasma proteomic profiling using surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry for the detection of head and neck squamous cell carcinomas (HNSCCs).Pretreatment plasma samples from HNSCC patients or controls without known neoplastic disease were analyzed on the Protein Biology System IIc SELDI-TOF mass spectrometer (Ciphergen Biosystems, Fremont, CA). Proteomic spectra of mass:charge ratio (m/z) were generated by the application of plasma to immobilized metal-affinity-capture (IMAC) ProteinChip arrays activated with copper. A total of 37356 data points were generated for each sample. A training set of spectra from 56 cancer patients and 52 controls were applied to the "Lasso" technique to identify protein profiles that can distinguish cancer from noncancer, and cross-validation was used to determine test errors in this training set. The discovery pattern was then used to classify a separate masked test set of 57 cancer and 52 controls. In total, we analyzed the proteomic spectra of 113 cancer patients and 104 controls.The Lasso approach identified 65 significant data points for the discrimination of normal from cancer profiles. The discriminatory pattern correctly identified 39 of 57 HNSCC patients and 40 of 52 noncancer controls in the masked test set. These results yielded a sensitivity of 68% and specificity of 73%. Subgroup analyses in the test set of four different demographic factors (age, gender, and cigarette and alcohol use) that can potentially confound the interpretation of the results suggest that this model tended to overpredict cancer in control smokers.Plasma proteomic profiling with SELDI-TOF mass spectrometry provides moderate sensitivity and specificity in discriminating HNSCC. Further improvement and validation of this approach is needed to determine its usefulness in screening for this disease.
View details for Web of Science ID 000222840700027
View details for PubMedID 15269156
-
Efficient quadratic regularization for expression arrays
BIOSTATISTICS
2004; 5 (3): 329-340
Abstract
Gene expression arrays typically have 50 to 100 samples and 1000 to 20,000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.
View details for DOI 10.1093/biostatistics/kxh010
View details for Web of Science ID 000222723600001
View details for PubMedID 15208198
-
Different gene expression patterns in invasive lobular and ductal carcinomas of the breast
MOLECULAR BIOLOGY OF THE CELL
2004; 15 (6): 2523-2536
Abstract
Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological types of breast cancer worldwide. Whereas IDC incidence has remained stable, ILC is the most rapidly increasing breast cancer phenotype in the United States and Western Europe. It is not clear whether IDC and ILC represent molecularly distinct entities and what genes might be involved in the development of these two phenotypes. We conducted comprehensive gene expression profiling studies to address these questions. Total RNA from 21 ILCs, 38 IDCs, two lymph node metastases, and three normal tissues were amplified and hybridized to approximately 42,000 clone cDNA microarrays. Data were analyzed using hierarchical clustering algorithms and statistical analyses that identify differentially expressed genes (significance analysis of microarrays) and minimal subsets of genes (prediction analysis for microarrays) that succinctly distinguish ILCs and IDCs. Eleven of 21 (52%) of the ILCs ("typical" ILCs) clustered together and displayed different gene expression profiles from IDCs, whereas the other ILCs ("ductal-like" ILCs) were distributed between different IDC subtypes. Many of the differentially expressed genes between ILCs and IDCs code for proteins involved in cell adhesion/motility, lipid/fatty acid transport and metabolism, immune/defense response, and electron transport. Many genes that distinguish typical and ductal-like ILCs are involved in regulation of cell growth and immune response. Our data strongly suggest that over half the ILCs differ from IDCs not only in histological and clinical features but also in global transcription programs. The remaining ILCs closely resemble IDCs in their transcription patterns. Further studies are needed to explore the differences between ILC molecular subtypes and to determine whether they require different therapeutic strategies.
View details for DOI 10.1091/mbc.E03-11-0786
View details for Web of Science ID 000221778300001
View details for PubMedID 15034139
View details for PubMedCentralID PMC420079
-
Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes
NEW ENGLAND JOURNAL OF MEDICINE
2004; 350 (18): 1828-1837
Abstract
Several gene-expression signatures can be used to predict the prognosis in diffuse large-B-cell lymphoma, but the lack of practical tests for a genome-scale analysis has restricted the use of this method.We studied 36 genes whose expression had been reported to predict survival in diffuse large-B-cell lymphoma. We measured the expression of each of these genes in independent samples of lymphoma from 66 patients by quantitative real-time polymerase-chain-reaction analyses and related the results to overall survival.In a univariate analysis, genes were ranked on the basis of their ability to predict survival. The genes that were the strongest predictors were LMO2, BCL6, FN1, CCND2, SCYA3, and BCL2. We developed a multivariate model that was based on the expression of these six genes, and we validated the model in two independent microarray data sets. The model was independent of the International Prognostic Index and added to its predictive power.Measurement of the expression of six genes is sufficient to predict overall survival in diffuse large-B-cell lymphoma.
View details for Web of Science ID 000221080300006
View details for PubMedID 15115829
-
Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2004; 101 (17): 6635-6640
Abstract
Toxicity from radiation therapy is a grave problem for cancer patients. We hypothesized that some cases of toxicity are associated with abnormal transcriptional responses to radiation. We used microarrays to measure responses to ionizing and UV radiation in lymphoblastoid cells derived from 14 patients with acute radiation toxicity. The analysis used heterogeneity-associated transformation of the data to account for a clinical outcome arising from more than one underlying cause. To compute the risk of toxicity for each patient, we applied nearest shrunken centroids, a method that identifies and cross-validates predictive genes. Transcriptional responses in 24 genes predicted radiation toxicity in 9 of 14 patients with no false positives among 43 controls (P = 2.2 x 10(-7)). The responses of these nine patients displayed significant heterogeneity. Of the five patients with toxicity and normal responses, two were treated with protocols that proved to be highly toxic. These results may enable physicians to predict toxicity and tailor treatment for individual patients.
View details for DOI 10.1073/pnas.0307761101
View details for PubMedID 15096622
-
Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia
NEW ENGLAND JOURNAL OF MEDICINE
2004; 350 (16): 1605-1616
Abstract
In patients with acute myeloid leukemia (AML), the presence or absence of recurrent cytogenetic aberrations is used to identify the appropriate therapy. However, the current classification system does not fully reflect the molecular heterogeneity of the disease, and treatment stratification is difficult, especially for patients with intermediate-risk AML with a normal karyotype.We used complementary-DNA microarrays to determine the levels of gene expression in peripheral-blood samples or bone marrow samples from 116 adults with AML (including 45 with a normal karyotype). We used unsupervised hierarchical clustering analysis to identify molecular subgroups with distinct gene-expression signatures. Using a training set of samples from 59 patients, we applied a novel supervised learning algorithm to devise a gene-expression-based clinical-outcome predictor, which we then tested using an independent validation group comprising the 57 remaining patients.Unsupervised analysis identified new molecular subtypes of AML, including two prognostically relevant subgroups in AML with a normal karyotype. Using the supervised learning algorithm, we constructed an optimal 133-gene clinical-outcome predictor, which accurately predicted overall survival among patients in the independent validation group (P=0.006), including the subgroup of patients with AML with a normal karyotype (P=0.046). In multivariate analysis, the gene-expression predictor was a strong independent prognostic factor (odds ratio, 8.8; 95 percent confidence interval, 2.6 to 29.3; P<0.001).The use of gene-expression profiling improves the molecular classification of adult AML.
View details for Web of Science ID 000220819800005
View details for PubMedID 15084693
-
Least angle regression
ANNALS OF STATISTICS
2004; 32 (2): 407-451
View details for Web of Science ID 000221411000001
-
Semi-supervised methods to predict patient survival from gene expression data
PLOS BIOLOGY
2004; 2 (4): 511-522
View details for DOI 10.1371/journal.pbio.00200108
View details for Web of Science ID 000221194700018
-
Semi-supervised methods to predict patient survival from gene expression data.
PLoS biology
2004; 2 (4): E108-?
Abstract
An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.
View details for PubMedID 15094809
-
Cancer characterization and feature set extraction by discriminative margin clustering
BMC BIOINFORMATICS
2004; 5
Abstract
A central challenge in the molecular diagnosis and treatment of cancer is to define a set of molecular features that, taken together, distinguish a given cancer, or type of cancer, from all normal cells and tissues.Discriminative margin clustering is a new technique for analyzing high dimensional quantitative datasets, specially applicable to gene expression data from microarray experiments related to cancer. The goal of the analysis is find highly specialized sub-types of a tumor type which are similar in having a small combination of genes which together provide a unique molecular portrait for distinguishing the sub-type from any normal cell or tissue. Detection of the products of these genes can then, in principle, provide a basis for detection and diagnosis of a cancer, and a therapy directed specifically at the distinguishing constellation of molecular features can, in principle, provide a way to eliminate the cancer cells, while minimizing toxicity to any normal cell.The new methodology yields highly specialized tumor subtypes which are similar in terms of potential diagnostic markers.
View details for Web of Science ID 000220984700002
View details for PubMedID 15070405
-
Guidelines - Expression profiling - best practices for data generation and interpretation in clinical trials
NATURE REVIEWS GENETICS
2004; 5 (3): 229-237
View details for DOI 10.1038/nrg1297
View details for Web of Science ID 000189334500018
-
Gene expression profiling identifies clinically relevant subtypes of prostate cancer
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2004; 101 (3): 811-816
Abstract
Prostate cancer, a leading cause of cancer death, displays a broad range of clinical behavior from relatively indolent to aggressive metastatic disease. To explore potential molecular variation underlying this clinical heterogeneity, we profiled gene expression in 62 primary prostate tumors, as well as 41 normal prostate specimens and nine lymph node metastases, using cDNA microarrays containing approximately 26,000 genes. Unsupervised hierarchical clustering readily distinguished tumors from normal samples, and further identified three subclasses of prostate tumors based on distinct patterns of gene expression. High-grade and advanced stage tumors, as well as tumors associated with recurrence, were disproportionately represented among two of the three subtypes, one of which also included most lymph node metastases. To further characterize the clinical relevance of tumor subtypes, we evaluated as surrogate markers two genes differentially expressed among tumor subgroups by using immunohistochemistry on tissue microarrays representing an independent set of 225 prostate tumors. Positive staining for MUC1, a gene highly expressed in the subgroups with "aggressive" clinicopathological features, was associated with an elevated risk of recurrence (P = 0.003), whereas strong staining for AZGP1, a gene highly expressed in the other subgroup, was associated with a decreased risk of recurrence (P = 0.0008). In multivariate analysis, MUC1 and AZGP1 staining were strong predictors of tumor recurrence independent of tumor grade, stage, and preoperative prostate-specific antigen levels. Our results suggest that prostate tumors can be usefully classified according to their gene expression patterns, and these tumor subtypes may provide a basis for improved prognostication and treatment stratification.
View details for DOI 10.1073/pnas.0304146101
View details for PubMedID 14711987
-
Central carbon metabolism genes that predict disease-free survival in hormone receptor negative tumors.
27th Annual San Antonio Breast Cancer Symposium
SPRINGER. 2004: S115–S115
View details for Web of Science ID 000225589600326
-
1-norm support vector machines
17th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 2004: 49–56
View details for Web of Science ID 000225309500007
-
Boosted PRIM with application to searching for oncogenic pathway of lung cancer
IEEE Computational Systems Bioinformatics Conference (CSB 2004)
IEEE COMPUTER SOC. 2004: 604–609
View details for Web of Science ID 000224127800102
-
Gene expression patterns in ovarian carcinomas
MOLECULAR BIOLOGY OF THE CELL
2003; 14 (11): 4376-4386
Abstract
We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.
View details for PubMedID 12960427
-
Changes in gene expression in intermediate endpoints of gastric cancer: A randomized, placebo-controlled trial of Helicobacter pylori eradication therapy.
2nd Annual Conference on Frontiers in Cancer Prevention Research
AMER ASSOC CANCER RESEARCH. 2003: 1280S–1280S
View details for Web of Science ID 000187153300018
-
Characterization of variant patterns of nodular lymphocyte predominant Hodgkin lymphoma with immunohistologic and clinical correlation
AMERICAN JOURNAL OF SURGICAL PATHOLOGY
2003; 27 (10): 1346-1356
Abstract
Nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) has traditionally been recognized as having two morphologic patterns, nodular and diffuse, and the current WHO definition of NLPHL requires at least a partial nodular pattern. Variant patterns have not been well documented. We analyzed retrospectively the morphologic and immunophenotypic patterns of NLPHL from 118 patients (total of 137 biopsy samples). Histology plus antibodies directed against CD20, CD3, and CD21 were used to evaluate the immunoarchitecture. We identified six distinct immunoarchitectural patterns in our cases of NLPHL: "classic" (B-cell-rich) nodular, serpiginous/interconnected nodular, nodular with prominent extranodular L&H cells, T-cell-rich nodular, diffuse with a T-cell-rich background (T-cell-rich B-cell lymphoma [TCRBCL]-like), and a (diffuse) B-cell-rich pattern. Small germinal centers within neoplastic nodules were found in approximately 15% of cases, a finding not previously emphasized in NLPHL. Prominent sclerosis was identified in approximately 20% of cases and was frequently seen in recurrent disease. Clinical follow-up was obtained on 56 patients, including 26 patients who had not had recurrence of disease and 30 patients who had recurrence. The follow-up period was 5 months to 16 years (median 2.5 years). The presence of a diffuse (TCRBCL-like) pattern was significantly more common in patients with recurrent disease than those without recurrence. Furthermore, the presence of a diffuse pattern (TCRBCL-like) was shown to be an independent predictor of recurrent disease (P = 0.00324). In addition, there is a tendency for progression to an increasingly more diffuse pattern over time. Analysis of sequential biopsies from patients with recurrent disease suggests that the presence of prominent extranodular L&H cells might represent early evolution to a diffuse (TCRBCL-like) pattern. We also report three patients who presented initially with diffuse large B-cell lymphoma and later developed NLPHL.
View details for Web of Science ID 000185584800007
View details for PubMedID 14508396
-
Statistical significance for genomewide studies
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (16): 9440-9445
Abstract
With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.
View details for DOI 10.1073/pnas.1530509100
View details for Web of Science ID 000184620000062
View details for PubMedID 12883005
-
Repeated observation of breast tumor subtypes in independent gene expression data sets
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (14): 8418-8423
Abstract
Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.
View details for DOI 10.1073/pnas.0932692100
View details for Web of Science ID 000184222500069
View details for PubMedID 12829800
View details for PubMedCentralID PMC166244
-
Note on "Comparison of model selection for regression" by Vladimir Cherkassky and Yunqian Ma
NEURAL COMPUTATION
2003; 15 (7): 1477-1480
Abstract
While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation--errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212).
View details for Web of Science ID 000183421400002
View details for PubMedID 12816562
-
Class prediction by nearest shrunken centroids, with applications to DNA microarrays
STATISTICAL SCIENCE
2003; 18 (1): 104-117
View details for Web of Science ID 000184301600006
-
HGAL is a novel interleukin-4-inducible gene that strongly predicts survival in diffuse large B-cell lymphoma
BLOOD
2003; 101 (2): 433-440
Abstract
We have cloned and characterized a novel human gene, HGAL (human germinal center-associated lymphoma), which predicts outcome in patients with diffuse large B-cell lymphoma (DLBCL). The HGAL gene comprises 6 exons and encodes a cytoplasmic protein of 178 amino acids that contains an immunoreceptor tyrosine-based activation motif (ITAM). It is highly expressed in germinal center (GC) lymphocytes and GC-derived lymphomas and is homologous to the mouse GC-specific gene M17. Expression of the HGAL gene is specifically induced in B cells by interleukin-4 (IL-4). Patients with DLBCL expressing high levels of HGAL mRNA demonstrate significantly longer overall survival than do patients with low HGAL expression. This association was independent of the clinical international prognostic index. High HGAL mRNA expression should be used as a prognostic factor in DLBCL.
View details for Web of Science ID 000180384800010
View details for PubMedID 12509382
-
Statistical methods for identifying differentially expressed genes in DNA microarrays.
Methods in molecular biology (Clifton, N.J.)
2003; 224: 149-157
View details for PubMedID 12710672
-
Expression of cytokeratins 17 and 5 identifies a group of breast carcinomas with poor clinical outcome
AMERICAN JOURNAL OF PATHOLOGY
2002; 161 (6): 1991-1996
Abstract
While several prognostic factors have been identified in breast carcinoma, the clinical outcome remains hard to predict for individual patients. Better predictive markers are needed to help guide difficult treatment decisions. In a previous study of 78 breast carcinoma specimens, we noted an association between poor clinical outcome and the expression of cytokeratin 17 and/or cytokeratin 5 mRNAs. Here we describe the results of immunohistochemistry studies using monoclonal antibodies against these markers to analyze more than 600 paraffin-embedded breast tumors in tissue microarrays. We found that expression of cytokeratin 17 and/or cytokeratin 5/6 in tumor cells was associated with a poor clinical outcome. Moreover, multivariate analysis showed that in node-negative breast carcinoma, expression of these cytokeratins was a prognostic factor independent of tumor size and tumor grade.
View details for PubMedID 12466114
-
Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (20): 12963-12968
Abstract
Genomic DNA copy number alterations are key genetic events in the development and progression of human cancers. Here we report a genome-wide microarray comparative genomic hybridization (array CGH) analysis of DNA copy number variation in a series of primary human breast tumors. We have profiled DNA copy number alteration across 6,691 mapped human genes, in 44 predominantly advanced, primary breast tumors and 10 breast cancer cell lines. While the overall patterns of DNA amplification and deletion corroborate previous cytogenetic studies, the high-resolution (gene-by-gene) mapping of amplicon boundaries and the quantitative analysis of amplicon shape provide significant improvement in the localization of candidate oncogenes. Parallel microarray measurements of mRNA levels reveal the remarkable degree to which variation in gene copy number contributes to variation in gene expression in tumor cells. Specifically, we find that 62% of highly amplified genes show moderately or highly elevated expression, that DNA copy number influences gene expression across a wide range of DNA copy number alterations (deletion, low-, mid- and high-level amplification), that on average, a 2-fold change in DNA copy number is associated with a corresponding 1.5-fold change in mRNA levels, and that overall, at least 12% of all the variation in gene expression among the breast tumors is directly attributable to underlying variation in gene copy number. These findings provide evidence that widespread DNA copy number alteration can lead directly to global deregulation of gene expression, which may contribute to the development or progression of cancer.
View details for DOI 10.1073/pnas.162471999
View details for Web of Science ID 000178391700085
View details for PubMedID 12297621
View details for PubMedCentralID PMC130569
-
Empirical Bayes methods and false discovery rates for microarrays
GENETIC EPIDEMIOLOGY
2002; 23 (1): 70-86
Abstract
In a classic two-sample problem, one might use Wilcoxon's statistic to test for a difference between treatment and control subjects. The analogous microarray experiment yields thousands of Wilcoxon statistics, one for each gene on the array, and confronts the statistician with a difficult simultaneous inference situation. We will discuss two inferential approaches to this problem: an empirical Bayes method that requires very little a priori Bayesian modeling, and the frequentist method of "false discovery rates" proposed by Benjamini and Hochberg in 1995. It turns out that the two methods are closely related and can be used together to produce sensible simultaneous inferences.
View details for DOI 10.1002/gepi.01124
View details for Web of Science ID 000176697800006
View details for PubMedID 12112249
-
Diagnosis of multiple cancer types by shrunken centroids of gene expression
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (10): 6567-6572
Abstract
We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.
View details for Web of Science ID 000175637300012
View details for PubMedID 12011421
-
Precision and functional specificity in mRNA decay
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (9): 5860-5865
Abstract
Posttranscriptional processing of mRNA is an integral component of the gene expression program. By using DNA microarrays, we precisely measured the decay of each yeast mRNA, after thermal inactivation of a temperature-sensitive RNA polymerase II. The half-lives varied widely, ranging from approximately 3 min to more than 90 min. We found no simple correlation between mRNA half-lives and ORF size, codon bias, ribosome density, or abundance. However, the decay rates of mRNAs encoding groups of proteins that act together in stoichiometric complexes were generally closely matched, and other evidence pointed to a more general relationship between physiological function and mRNA turnover rates. The results provide strong evidence that precise control of the decay of each mRNA is a fundamental feature of the gene expression program in yeast.
View details for DOI 10.1073/pnas.092538799
View details for Web of Science ID 000175377800023
View details for PubMedID 11972065
View details for PubMedCentralID PMC122867
-
Transcriptional programs activated by exposure of human prostate cancer cells to androgen
GENOME BIOLOGY
2002; 3 (7)
Abstract
Androgens are required for both normal prostate development and prostate carcinogenesis. We used DNA microarrays, representing approximately 18,000 genes, to examine the temporal program of gene expression following treatment of the human prostate cancer cell line LNCaP with a synthetic androgen.We observed statistically significant changes in levels of transcripts of more than 500 genes. Many of these genes were previously reported androgen targets, but most were not previously known to be regulated by androgens. The androgen-induced expression programs in three additional androgen-responsive human prostate cancer cell lines, and in four androgen-independent subclones derived from LNCaP, shared many features with those observed in LNCaP, but some differences were observed. A remarkable fraction of the genes induced by androgen appeared to be related to production of seminal fluid and these genes included many with roles in protein folding, trafficking, and secretion.Prostate cancer cell lines retain features of androgen responsiveness that reflect normal prostatic physiology. These results provide a broad view of the effect of androgen signaling on the transcriptional program in these cancer cells, and a foundation for further studies of androgen action.
View details for Web of Science ID 000207581200008
View details for PubMedID 12184806
-
Pre-validation and inference in microarrays.
Statistical applications in genetics and molecular biology
2002; 1: Article1-?
Abstract
In microarray studies, an important problem is to compare a predictor of disease outcome derived from gene expression levels to standard clinical predictors. Comparing them on the same dataset that was used to derive the microarray predictor can lead to results strongly biased in favor of the microarray predictor. We propose a new technique called "pre-validation'' for making a fairer comparison between the two sets of predictors. We study the method analytically and explore its application in a recent study on breast cancer.
View details for PubMedID 16646777
-
Supervised learning from microarray data
15th Biannual Conference on Computational Statistics (COMPSTAT)
PHYSICA-VERLAG GMBH & CO. 2002: 67–77
View details for Web of Science ID 000179942900007
-
Exploratory screening of genes and clusters from microarray experiments
STATISTICA SINICA
2002; 12 (1): 47-59
View details for Web of Science ID 000174372800004
-
Empirical Bayes analysis of a microarray experiment
160th Annual Meeting of the American-Statistical-Association
AMER STATISTICAL ASSOC. 2001: 1151–60
View details for Web of Science ID 000172728000002
-
Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2001; 98 (19): 10869-10874
Abstract
The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
View details for Web of Science ID 000170966800067
View details for PubMedID 11553815
View details for PubMedCentralID PMC58566
-
Expression of a single gene, BCL-6, strongly predicts survival in patients with diffuse large B-cell lymphoma
BLOOD
2001; 98 (4): 945-951
Abstract
Diffuse large B-cell lymphoma (DLBCL) is characterized by a marked degree of morphologic and clinical heterogeneity. Establishment of parameters that can predict outcome could help to identify patients who may benefit from risk-adjusted therapies. BCL-6 is a proto-oncogene commonly implicated in DLBCL pathogenesis. A real-time reverse transcription-polymerase chain reaction assay was established for accurate and reproducible determination of BCL-6 mRNA expression. The method was applied to evaluate the prognostic significance of BCL-6 expression in DLBCL. BCL-6 mRNA expression was assessed in tumor specimens obtained at the time of diagnosis from 22 patients with primary DLBCL. All patients were subsequently treated with anthracycline-based chemotherapy regimens. These patients could be divided into 2 DLBCL subgroups, one with high BCL-6 gene expression whose median overall survival (OS) time was 171 months and the other with low BCL-6 gene expression whose median OS was 24 months (P =.007). BCL-6 gene expression also predicted OS in an independent validation set of 39 patients with primary DLBCL (P =.01). BCL-6 protein expression, assessed by immunohistochemistry, also predicted longer OS in patients with DLBCL. BCL-6 gene expression was an independent survival predicting factor in multivariate analysis together with the elements of the International Prognostic Index (IPI) (P =.038). By contrast, the aggregate IPI score did not add further prognostic information to the patients' stratification by BCL-6 gene expression. High BCL-6 mRNA expression should be considered a new favorable prognostic factor in DLBCL and should be used in the stratification and the design of risk-adjusted therapies for patients with DLBCL. (Blood. 2001;98:945-951)
View details for Web of Science ID 000170364100008
View details for PubMedID 11493437
-
Missing value estimation methods for DNA microarrays
BIOINFORMATICS
2001; 17 (6): 520-525
Abstract
Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
View details for Web of Science ID 000169404700005
View details for PubMedID 11395428
-
Significance analysis of microarrays applied to the ionizing radiation response
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2001; 98 (9): 5116-5121
Abstract
Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
View details for Web of Science ID 000168311500058
View details for PubMedID 11309499
-
Supervised harvesting of expression trees
GENOME BIOLOGY
2001; 2 (1)
Abstract
We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
View details for Web of Science ID 000207583500011
View details for PubMedID 11178280
-
Estimating the number of clusters in a data set via the gap statistic
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2001; 63: 411-423
View details for Web of Science ID 000168837200013
-
The inference of antigen selection on Ig genes
JOURNAL OF IMMUNOLOGY
2000; 165 (9): 5122-5126
Abstract
Analysis of somatic mutations in V regions of Ig genes is important for understanding various biological processes. It is customary to estimate Ag selection on Ig genes by assessment of replacement (R) as opposed to silent (S) mutations in the complementary-determining regions and S as opposed to R mutations in the framework regions. In the past such an evaluation was performed using a binomial distribution model equation, which is inappropriate for Ig genes in which mutations have four different distribution possibilities (R and S mutations in the complementary-determining region and/or framework regions of the gene). In the present work, we propose a multinomial distribution model for assessment of Ag selection. Side-by-side application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection. We suggest the use of the multinomial model for all future analysis of Ag selection.
View details for Web of Science ID 000090076000047
View details for PubMedID 11046043
-
Bayesian backfitting - Comments and rejoinder
STATISTICAL SCIENCE
2000; 15 (3): 213-223
View details for Web of Science ID 000166404100003
-
Bayesian backfitting
STATISTICAL SCIENCE
2000; 15 (3): 196-213
View details for Web of Science ID 000166404100002
-
Additive logistic regression: A statistical view of boosting
ANNALS OF STATISTICS
2000; 28 (2): 337-374
View details for Web of Science ID 000089669700001
-
Molecular analysis of immunoglobulin genes in diffuse large B-cell lymphomas
BLOOD
2000; 95 (5): 1797-1803
Abstract
Diffuse large B-cell lymphoma (DLBCL) is a common type of non-Hodgkin's lymphoma (NHL) that is highly heterogeneous from both clinical and histopathologic viewpoints. The immunoglobulin (Ig) heavy (H) chain variable region genes were examined in 71 patients with untreated primary DLBCL. Fifty-eight potentially functional V(H) genes were detected in 53 DLBCL cases; V(H) genes were nonfunctional in 9 cases and were not detected in an additional 9 cases. The use of V(H) gene families by DLBCL tumors was unbiased without overrepresentation of any particular V(H) gene or gene family. Analysis of Ig mutations in comparison to the most closely related germline gene disclosed mutated V(H) genes in all but 1 DLBCL case. More than 2% difference from the most similar germline sequence was detected in 52 potentially functional and the 8 nonfunctional V(H) gene sequences, whereas less than 2% difference from the germline sequence was observed in 3 V(H) gene isolates. Only 3 V(H) gene isolates were unmutated. No correlation was found between V(H) gene use, mutation level, and International Prognostic Index (IPI) or survival. Six of 8 tested tumors showed evidence of ongoing somatic mutations. Evidence for positive or negative antigen selection pressure was observed in 65% of mutated DLBCL cases. Our findings indicate that the etiology and the driving forces for clonal expansion are heterogeneous, which may explain the well-known clinical and pathologic heterogeneity of DLBCL. (Blood. 2000;95:1797-1803)
View details for Web of Science ID 000085564700037
View details for PubMedID 10688840
-
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
NATURE
2000; 403 (6769): 503-511
Abstract
Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.
View details for Web of Science ID 000085227300039
View details for PubMedID 10676951
-
'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.
Genome biology
2000; 1 (2): RESEARCH0003-?
Abstract
Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.
View details for PubMedID 11178228
-
Model search by bootstrap "bumping"
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
1999; 8 (4): 671-686
View details for Web of Science ID 000084566000001
-
Statistical measures for the computer-aided diagnosis of mammographic masses
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
1999; 8 (3): 531-543
View details for Web of Science ID 000083134100011
-
The covariance inflation criterion for adaptive model selection
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
1999; 61: 529-546
View details for Web of Science ID 000080641500002
-
The problem of regions
ANNALS OF STATISTICS
1998; 26 (5): 1687-1718
View details for Web of Science ID 000079135700002
-
Classification by pairwise coupling
ANNALS OF STATISTICS
1998; 26 (2): 451-471
View details for Web of Science ID 000079135400001
-
Classification by pairwise coupling
11th Annual Conference on Neural Information Processing Systems (NIPS)
MIT PRESS. 1998: 507–513
View details for Web of Science ID 000075130700072
-
Improvements on cross-validation: The .632+ bootstrap method
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
1997; 92 (438): 548-560
View details for Web of Science ID A1997XE29600020
-
The lasso method for variable selection in the cox model
STATISTICS IN MEDICINE
1997; 16 (4): 385-395
Abstract
I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the 'lasso' proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.
View details for Web of Science ID A1997WK01900006
View details for PubMedID 9044528
- Association between cellular phones and car collisions {\it New. Engl. J. Med} 1997
-
Using specially designed exponential families for density estimation
ANNALS OF STATISTICS
1996; 24 (6): 2431-2461
View details for Web of Science ID A1996WK45900006
-
Discriminant adaptive nearest neighbor classification and regression
9th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 1996: 409–415
View details for Web of Science ID A1996BG45M00058
-
Generalized additive models for medical research.
Statistical methods in medical research
1995; 4 (3): 187-196
Abstract
This article reviews flexible statistical methods that are useful for characterizing the effect of potential prognostic factors on disease endpoints. Applications to survival models and binary outcome models are illustrated.
View details for PubMedID 8548102
- Flexible discriminant analysis {\it J. Amer. Statist. Assoc. } 1994
- An Introduction to the Bootstrap Chapman and Hall, New York and London. 1993
- {\it Generalized additive models}, Chapman and Hall, London 1990