<![CDATA[UPS Delivers Optimal Phase Diagram for High Dimensional Variable Selection]]>

65151 event 1300976838 1475891678 <![CDATA[UPS Delivers Optimal Phase Diagram for High Dimensional Variable Selection]]> TITLE: UPS Delivers Optimal Phase Diagram for High Dimensional Variable Selection

SPEAKER: Jiashun Jin

ABSTRACT:

Consider a linear regression model \begin{equation*}
Y = X \beta + z, \qquad z \sim N(0, I_n),   \qquad X = X_{n, p},
\end{equation*} where both $p$ and $n$ are large but $p > n$.   The vector $\beta$ is unknown but is sparse in the sense that only a small proportion of   its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of $\beta$ as samples from a two-component mixture $(1 - \eps) \nu_0 + \eps {\pi}$,   and the rows of $X$ as samples from $N(0, \frac{1}{n}\Omega)$, where $\nu_0$ is the point mass at $0$, $\pi$ is a distribution, and $\Omega$ is a $p$ by $p$ correlation matrix which is unknown but is presumably sparse.

We propose a two-stage variable selection procedure which we call the {\it UPS}.   This is a Screen and Clean procedure, in which   we screen with the Univariate thresholding, and clean with the Penalized MLE. In many situations, the UPS possesses two important properties: Sure Screening and Separable After Screening (SAS). These properties enable us to reduce the original regression problem to many small-size regression problems that can be fitted separately. As a result, the UPS is effective both in theory and in computation.

We measure the performance of variable selection procedure by the Hamming distance, and use an asymptotic framework where $p \goto \infty$ and $(\eps, \pi, n, \Omega)$ depend on $p$. We find that in many situations, the UPS achieves the optimal rate of convergence.
We also find that in the $(\eps_p, \pi_p)$ space, there is a three-phase diagram shared    by many choices of $\Omega$. In the first phase, it is possible to   recover all   signals. In the second phase, exact recovery is impossible, but it is possible to recover most of the signals.
In the third phase, successful variable selection is impossible. The UPS partitions the phase space in the same way that the optimal procedures do, and recovers most of the signals
as long as successful variable selection is possible.

The lasso and the subset selection (also known as the $L^1$- and $L^0$-penalization methods, respectively) are well-known approaches to variable selection. However,
somewhat surprisingly, there are regions in the phase space where neither the lasso nor the subset selection is rate optimal, even for very simple $\Omega$. The lasso is non-optimal because it is too loose in filtering out fake signals (i.e. noise that is highly correlated with a signal), and the subset selection is non-optimal because it tends to kill one or more signals   in correlated pairs, triplets, etc..

]]> <![CDATA[]]> 1242