Karthik Devarajan, PhD
Office Phone: 215-728-2794
1. Molecular Pattern Discovery and Dimensionality Reduction Using NMF—a Unified FrameworkIn collaboration with Ebrahimi
We address the problem of molecular pattern discovery and dimensionality reduction for large-scale biological data using NMF. Molecular pattern discovery is concerned with molecular pattern recognition via unsupervised cluster- ing, and the identification of clusters of samples or genes revealed by the gene expression pro- files. Analysis of genome-wide expression patterns provide unique insights into the struc- ture of genetic networks and into biological processes not yet understood at the molecular level. Class discovery is concerned with the identification of hidden features in gene expres- sion profiles that reflect molecular signatures of the tissue from which the cells originated. On the other hand, one could be also interested in identifying clusters of genes revealed by their expression profiles across samples.
NMF by multiplicative updates algorithm is a powerful method for decomposing the gene expression matrix V into two matrices with nonnegative entries, V ~ WH, where each column of W defines a metagene and each column of H represents the metagene expression pattern of the corresponding sample (Lee and Seung, Nature 401:759, 1999). We proposed an unsupervised method for molecular pattern discovery using NMF, based on Renyi’s diver- gence between two non-negative matrices. We demonstrated the applicability of our approach using cancer microarray data for elucidating cancer sub-types and identifying genetic networks. We have also successfully applied this method as an exploratory tool for visualization as well as for reducing the dimensionality of large-scale biological data sets. Our methods have been implemented in Matlab (Mathworks, Natick, MA) and in the statistical language R (R Core Development Team (2006), Vienna, Austria. ISBN 3-900051-07-0, R Project).Top
2. Parallel Implementation of NMF Algorithms Using High-Performance Computing (HPC) ClusterIn collaboration with Wang
The NMF update rules derived in our original work (Devarajan K et al, 2005) guarantee convergence of the NMF algorithm to a local minimum based on random initial values for the nonnegative matrices W and H. However, the algorithm may not converge to the same solution on each run due to the stochastic nature of initial conditions. This feature can be exploited to evaluate the consistency of the performance of the algorithm by applying it multiple times (typically 50–200) with random initial starting values for W and H. For any given value of the parameter α in Renyi’s divergence (see above), the algorithm groups the samples or genes into κ clusters, where κ is the pre-specified rank of the factorization. In order to assess whether a given rank κ provides a biologically meaningful decomposition of the data, we apply a model selection procedure that utilizes consensus clustering and provides a quantitative evaluation of the robustness of the factorization. (Devarajan K et al, 2007; Brunet et al, Proc Natl Acad Sci. 101:4164, 2004). This would enable us to evaluate various choices of κ and α, each based on multiple runs, and choose the appropriate model for a given data set.
The implementation of these steps in the model selection procedure for any real micro-array dataset is computationally very intensive. But each step in the model selection procedure can be run independently and simultaneously, and thus the NMF algorithm lends itself easily to a parallel implementation that would greatly increase speed and efficiency. In order to efficiently apply this method to large-scale biological data, we implemented this algorithm on the HPC cluster using the Message Passing Interface (MPI)/C++ platform. We have also created an integrated package that consists of the following steps: data input, initialization and model parameterization, model selection, graphical display and output of results via a graphical user interface that communicates between a Windows desktop and the HPC cluster. The interface and connection between the desktop and HPC cluster was built using C#, and the rest of the package was implemented in C++ on top of a MPI.Top
3. Evaluation of In Vivo and In Vitro Pharmacology and Toxicology of Preventive Agents Using Human Mutant Cells from Dominantly Heritable CancersIn collaboration with Alfred Knudson, Jr, Alfonso Bellacosa, Margie Clapper & Caretti
The purpose of this study is to identify potential molecular targets of cancer chemopreventive agents. The greatest opportunity to identify such early biomarkers is provided by dominantly inherited cancer syndromes whose responsible germinally mutant genes have been characterized. We examined the mRNA expression profiles of primary cultures from selected tissues (kidney epithelial cells and colon fibroblasts) obtained from individuals with six representative heritable cancer syndromes using Affymetrix technology. This was done in the presence and absence of a panel of putative chemopreventive agents. We pre-processed gene expression data using the Robust Multichip Average method (Irizarry et al, Biostatistics 249:1406, 2003). We compared expression profiles between genotypes and treatments using the local pooled error method, analysis of variance and rank-based tests. Statistical significance was measured in terms of the false discovery rate (FDR) given by q-value, while biological significance was measured in terms of the ratio of mean expressions between the classes being compared.The q-value for a given gene is the minimum FDR incurred when calling that gene significant. A volcano plot of q-values versus fold-change on the logarithmic scale enabled us to visualize the relationship between statistical and biological significance. We identified differentially expressed genes between treatments within genotypes, genotypes within treatments and identified genes for which the effect of treatment was different between genotypes. We applied unsupervised clustering methods such as NMF and hierarchical clustering to identify potential sub-groups of interest within the genes, and are currently mining these gene lists further using pathway analysis. We are also in the process of validating these biomarkers via quantitative PCR.Top
4. Gene Expression Analysis of Time-Course Microarray DataIn collaboration with David Wiest & Jennifer Rhodes
Immature thymocytes were induced to differentiate in a synchronized wave and then time points were taken (3, 6, 12, and 24 hours) that represented incremental advances from the undifferentiated (0 hour) to the fully differentiated (24 hours) state. Genes whose expression was modulated during differentiation were identified by performing expression profiling across the 4 time points listed. The list of differentially expressed genes was then interrogated using NMF and relevant biological correlations were extracted. The stochastic nature of the NMF algorithm provided an approach for evaluating the stability of the clustering and a quantitative evaluation of the performance of the method. We adopted a generalized approach to molecular pattern discovery using NMF, as outlined in Devarajan and Ebrahimi (Devarajan K et al, 2005; Devarajan K et al, 2007). We investigated different models based on the divergence measure used in the factorization (see above). Overall, NMF analysis identified six distinct clusters (1–6) of genes with high confidence. These genes showed varying expression profiles across the four time points. Cluster 6 was further sub-divided into two clusters (6a and 6b). Among the genes identified are many known to be functionally important in regulating thymocyte development. All of these genes exhibited significant changes in expression by the 6-hour time point and were enriched in clusters 3 and 6b. Based on this, we identified 30 candidate genes whose importance in thymocyte development and transformation will be assessed through functional screens. In addition, based on this analysis we selected the 6-hour time point, at which most genes known to regulate early thymocyte development are differentially expressed, as a source of mRNA from which to produce a cDNA library that will be employed in our forward genetic screen.Top