Publication View

Distance-based Model-Selection with application to the Analysis of Gene Expression Data (2003)

Abstract
Multivariate mixture models provide a convenient method of density estimation and model based clustering as well as providing possible explanations for the actual data generation process. But the problem of choosing the number of components ($g$) in a statistically meaningful way is still a subject of considerable research . Available methods for estimating $g$ include, optimizing AIC and BIC, estimating the number through nonparametric maximum likelihood, hypothesis testing and Bayesian approaches with entropy distances. In our current research we present several rules for selecting a finite mixture model, and hence $g$, based on estimation and inference using a quadratic distance measure. In one methodology the goal is to find the minimal number of components that are needed to adequately describe the true distribution based on a nonparametric confidence set for the true distribution. We also present results for selecting $g$ based on a risk analysis that includes a penalty for overfitting. Another less formal methodology is based on the concordance measure which is analogous to $R^2$ in regression. Moreover, we find develop diagnostics for purposes of outlier detection. These diagnostics help to distinguish between outliers and true clusters, and they provide insight into the initial values for iterative estimation of additional components. In this dissertation we also develop tools for determining the number of modes in a mixture of multivariate normal densities. We use these criterion to select clusters which display distinct modes. Finally we fine tune our methods to analyze gene-expression data from micro-arrays, and compare them with other competitive methods.

Publication details
Download http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-375/index.html
Source http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-375/index.html
Publisher Penn State
Contributors Bruce G. Lindsay, Thomas P. Hettmansperger, Francesca Chiaromonte, Benjamin F. Pugh
Repository Penn State Electronic Thesis and Dissertation Collection (United States)
Keywords Statistics
Type text
Language Englisch