| In Defense of l0 (2009) | |||||||||||||||||
Abstract | |||||||||||||||||
| In the past decade, there has been an explosion of interest in using l1-regularization in replace of l0regularization for feature selection. We present results showing that while l1-regularization never outperforms l0-regularization by more than a constant factor, in some cases using an l1 penalty is infinitely worse than an l0 penalty. We also compare algorithms solving these two problems from several aspects and show that, although good solutions have been developed for l1 problem, they may not perform as well as the very classic stepwise regression, which is a greedy l0 surrogate. In other words, “an approximate solution to the right problem ” can be better than “the exact solutions to the wrong problem”. We focus on variable selection problems in which there is a large set of potential features, only a few of which are likely to be helpful. This type of sparsity occurs often in various machine learning tasks, such as predicting disease based on millions of genes, or predicting the topic of a document based on the occurances of hundreds of thousands of words. Consider a normal linear model y = Xβ + ε, where y is the response variable with n observation, X = (x1,..., xp) is an n × p design matrix, β = (β1,..., βp) ′ is the coefficient parameters, and error ε ∼ N(0, σ2In). Assume that only a subset of {xj} p j=1 has nonzero coefficients. The traditional statistical approach to this problem, namely, the l0 problem, finds an estimator that mini-Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute. mizes the l0 penalized sum of squared errors | |||||||||||||||||
Publication details | |||||||||||||||||
| |||||||||||||||||