User:Lovebeloved/sandbox

This is the user sandbox of Lovebeloved. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $\ell _{1}$ regularization (also known as Lasso) of the form

\min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{1},\quad {\text{ where }}x_{i}\in \mathbb {R} ^{d}{\text{ and }}y_{i}\in \mathbb {R} .

Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.^[1]^[2] Such customized penalties can help to induce certain structure in problem solutions, such as sparsity (in the case of lasso) or group structure (in the case of group lasso).

Relevant background[edit]

Proximal gradient methods are applicable in a wide variety of scenarios for solving convex optimization problems of the form

\min _{x\in {\mathcal {H}}}F(x)+R(x),

where $F$ is convex and differentiable with Lipschitz continuous gradient, $R$ is a convex, lower semicontinuous function which is possibly nondifferentiable, and ${\mathcal {H}}$ is some set, typically a Hilbert space. The usual criterion of $x$ minimizes $F(x)+R(x)$ if and only if $\nabla (F+R)(x)=0$ in the convex, differentiable setting is now replaced by

0\in \partial (F+R)(x),

where $\partial \varphi$ denotes the subdifferential of a real-valued, convex function $\varphi$ .

Given a convex function $\varphi :{\mathcal {H}}\to \mathbb {R}$ an important operator to consider is its proximity operator $\operatorname {prox} _{\varphi }:{\mathcal {H}}\to {\mathcal {H}}$ defined by

\operatorname {prox} _{\varphi }(u)=\operatorname {arg} \min _{x\in {\mathcal {H}}}\varphi (x)+{\frac {1}{2}}\|u-x\|_{2}^{2},

which is well-defined because of the strict convexity of the $\ell _{2}$ norm. The proximity operator can be seen as a generalization of a projection.^[1]^[3]^[4] We see that the proximity operator is important because $x^{*}$ is a minimizer to the problem $\min _{x\in {\mathcal {H}}}F(x)+R(x)$ if and only if

x^{*}=\operatorname {prox} _{\gamma R}\left(x^{*}-\gamma \nabla F(x^{*})\right),

where

\gamma >0

is any positive real number.^[1]

Moreau decomposition[edit]

One important technique related to proximal gradient methods is the Moreau decomposition, which decomposes the identity operator as the sum of two proximity operators.^[1] Namely, let $\varphi :{\mathcal {X}}\to \mathbb {R}$ be a lower semicontinuous, convex function on a vector space ${\mathcal {X}}$ . We define its Fenchel conjugate $\varphi ^{*}:{\mathcal {X}}\to \mathbb {R}$ to be the function

\varphi ^{*}(u):=\sup _{x\in {\mathcal {X}}}\langle x,u\rangle -\varphi (x).

The general form of Moreau's decomposition states that for any $x\in {\mathcal {X}}$ and any $\gamma >0$ that

x=\operatorname {prox} _{\gamma \varphi }(x)+\gamma \operatorname {prox} _{\varphi ^{*}/\gamma }(x/\gamma ),

which for $\gamma =1$ implies that $x=\operatorname {prox} _{\varphi }(x)+\operatorname {prox} _{\varphi ^{*}}(x)$ .^[1]^[3] The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a vector space, analogous with the fact that proximity operators are generalizations of projections.^[1]

In certain situations it may be easier to compute the proximity operator for the conjugate $\varphi ^{*}$ instead of the function $\varphi$ , and therefore the Moreau decomposition can be applied. This is the case for group lasso.

Lasso regularization[edit]

Consider the regularized empirical risk minimization problem with square loss and with the $\ell _{1}$ norm as the regularization penalty:

\min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{1},

where $x_{i}\in \mathbb {R} ^{d}{\text{ and }}y_{i}\in \mathbb {R} .$ The $\ell _{1}$ regularization problem is sometimes referred to as lasso (least absolute shrinkage and selection operator).^[5] Such $\ell _{1}$ regularization problems are interesting because they induce sparse solutions, that is, solutions $w$ to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem

\min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \|w\|_{0},

where $\|w\|_{0}$ denotes the $\ell _{0}$ "norm", which is the number of nonzero entries of the vector $w$ . Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.^[5]

Solving for $\ell _{1}$ proximity operator[edit]

For simplicity we restrict our attention to the problem where $\lambda =1$ . To solve the problem

\min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\|w\|_{1},

we consider our objective function in two parts: a convex, differentiable term $F(w)={\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}$ and a convex function $R(w)=\|w\|_{1}$ . Note that $R$ is not strictly convex.

Let us compute the proximity operator for $R(w)$ . First we find an alternative characterization of the proximity operator $\operatorname {prox} _{R}(x)$ as follows:

${\begin{aligned}u=\operatorname {prox} _{R}(x)\iff &0\in \partial \left(R(u)+{\frac {1}{2}}\|u-x\|_{2}^{2}\right)\\\iff &0\in \partial R(u)+u-x\\\iff &x-u\in \partial R(u).\end{aligned}}$

For $R(w)=\|w\|_{1}$ it is easy to compute $\partial R(w)$ : the $i$ th entry of $\partial R(w)$ is precisely

\partial |w_{i}|={\begin{cases}1,&w_{i}>0\\-1,&w_{i}<0\\\left[-1,1\right],&w_{i}=0.\end{cases}}

Using the recharacterization of the proximity operator given above, for the choice of $R(w)=\|w\|_{1}$ and $\gamma >0$ we have that $\operatorname {prox} _{\gamma R}(x)$ is defined entrywise by

\left(\operatorname {prox} _{\gamma R}(x)\right)_{i}={\begin{cases}x_{i}-\gamma ,&x_{i}>\gamma \\0,&|x_{i}|\leq \gamma \\x_{i}+\gamma ,&x_{i}<-\gamma ,\end{cases}}

which is known as the soft thresholding operator $S_{\gamma }(x)=\operatorname {prox} _{\gamma \|\cdot \|_{1}}(x)$ .^[1]^[6]

Fixed point iterative schemes[edit]

To finally solve the lasso problem we consider the fixed point equation shown earlier:

x^{*}=\operatorname {prox} _{\gamma R}\left(x^{*}-\gamma \nabla F(x^{*})\right).

Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial $w^{0}\in \mathbb {R} ^{d}$ , and for $k=1,2,\ldots$ define

w^{k+1}=S_{\gamma }\left(w^{k}-\gamma \nabla F\left(w^{k}\right)\right).

Note here the effective trade-off between the empirical error term $F(w)$ and the regularization penalty $R(w)$ . This fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step ( $w^{k}-\gamma \nabla F\left(w^{k}\right)$ ) and a soft thresholding step (via $S_{\gamma }$ ).

Convergence of this fixed point scheme is well-studied in the literature^[1]^[6] and is guaranteed under appropriate choice of step size $\gamma$ and loss function (such as the square loss taken here). Accelerated methods were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on $F$ .^[7] Such methods have been studied extensively in previous years.^[8] For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term $R$ , such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.^[4]^[9]

Practical considerations[edit]

There have been numerous developments within the past decade in convex optimization techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.^[2]^[10]

Adaptive step size[edit]

In the fixed point iteration scheme

w^{k+1}=\operatorname {prox} _{\gamma R}\left(w^{k}-\gamma \nabla F\left(w^{k}\right)\right),

one can allow variable step size $\gamma _{k}$ instead of a constant $\gamma$ . Numerous adaptive step size schemes have been proposed throughout the literature.^[1]^[4]^[11]^[12] Applications of these schemes^[2]^[13] suggest that these can offer substantial improvement in number of iterations required for fixed point convergence.

Elastic net (mixed norm regularization)[edit]

Elastic net regularization offers an alternative to pure $\ell _{1}$ regularization. The problem of lasso ( $\ell _{1}$ ) regularization involves the penalty term $R(w)=\|w\|_{1}$ , which is not strictly convex. Hence, solutions to $\min _{w}F(w)+R(w),$ where $F$ is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an $\ell _{2}$ norm regularization penalty. For example, one can consider the problem

\min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle )^{2}+\lambda \left((1-\mu )\|w\|_{1}+\mu \|w\|_{2}\right),

where $x_{i}\in \mathbb {R} ^{d}{\text{ and }}y_{i}\in \mathbb {R} .$ For $0<\mu \leq 1$ the penalty term $\lambda \left((1-\mu )\|w\|_{1}+\mu \|w\|_{2}\right)$ is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small $\mu >0$ , the additional penalty term $\mu \|w\|_{2}$ acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.^[2]^[14]

Exploiting group structure[edit]

Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in statistical learning theory. Certain problems in learning can often involve data which has additional structure that is known a priori. In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications. Here we survey a few such methods.

Group lasso[edit]

Group lasso is a generalization of the lasso method when features are grouped into disjoint blocks.^[15] Suppose the features are grouped into blocks $\{w_{1},\ldots ,w_{G}\}$ . Here we take as a regularization penalty

R(w)=\sum _{g=1}^{G}\|w_{g}\|_{2},

which is the sum of the $\ell _{2}$ norm on corresponding feature vectors for the different groups. A similar proximity operator analysis as above can be used to compute the proximity operator for this penalty. Where the lasso penalty has a proximity operator which is soft thresholding on each individual component, the proximity operator for the group lasso is soft thresholding on each group. For the group $w_{g}$ we have that proximity operator of $\lambda \gamma \left(\sum _{g=1}^{G}\|w_{g}\|_{2}\right)$ is given by

{\widetilde {S}}_{\lambda \gamma }(w_{g})={\begin{cases}w_{g}-\lambda \gamma {\frac {w_{g}}{\|w_{g}\|_{2}}},&\|w_{g}\|_{2}>\lambda \gamma \\0,&\|w_{g}\|_{2}\leq \lambda \gamma \end{cases}}

where $w_{g}$ is the $g$ th group.

In contrast to lasso, the derivation of the proximity operator for group lasso relies on the Moreau decomposition. Here the proximity operator of the conjugate of the group lasso penalty becomes a projection onto the ball of a dual norm.^[2]

Other group structures[edit]

In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts.^[16]^[17]^[18]^[19] For overlapping groups one common approach is known as latent group lasso which introduces latent variables to account for overlap.^[20]^[21] Nested group structures are studied in hierarchical structure prediction and with directed acyclic graphs.^[18]

Method for learning matrices[edit]

Proximal operator can be extended from vector function to matrix function when the learning object is a matrix instead of a vector.

Learning matrices[edit]

A general framework for such setting is the following: consider a training set $S=(X_{i}^{t},y_{i}^{t})$ where $X_{i}^{t},W\in \mathbb {R} ^{D\times T},y_{i}^{t},\epsilon _{i}^{t}\in \mathbb {R}$ for $i=1,...,n_{t},t=1,...,T$ , and assume the regression model

y_{i}^{t}=\langle W,X_{i}^{t}\rangle _{F}+\epsilon _{i}^{t}

where $\langle W,X_{i}^{t}\rangle _{F}=Tr(W'X_{i}^{t})$ is the Frobenuis (Hilbert-Schmidt) inner product. Typical examples of learning matrices are the special cases of the above general model:

Let $(e_{t})_{t}$ is the canonical basis in $\mathbb {R} ^{T}$ ,

X_{i}^{t}=e_{t}\otimes x_{i}^{t}

\rightarrow

Linear Multi-task learning:

y_{i,t}=\langle W^{t},x_{i}^{t}\rangle _{\mathbb {R} ^{D}}+\epsilon _{i,t}

where

W^{t}\in \mathbb {R} ^{D}

is the regression vector for

x_{i}^{t}\in \mathbb {R} ^{D}

.

X_{i}^{t}=e_{t}\otimes x_{i}

\rightarrow

Linear Multivariate Regression:

y_{i}=W'x_{i}+\epsilon _{i}

where

y_{i},\epsilon _{i}\in \mathbb {R} ^{T}

and

x_{i}\in \mathbb {R} ^{D}

X_{i}^{t}=e_{t}\otimes e_{i}'

\rightarrow

Matrix completion:

y_{i,t}=W_{i,t}+\epsilon _{i,t}

where

(e_{i}')_{i}

is the canonical basis in

\mathbb {R} ^{D}

The corresponding penalized regularization scheme for the model is:

\min _{W\in {\mathcal {H}}}{\hat {\mathcal {E}}}(W)+R(W)

where ${\mathcal {H}}$ is the space induced by Hilbert-Schmidt operator, ${\hat {\mathcal {E}}}$ is an empirical error defined by a loss function $V:\mathbb {R} \times \mathbb {R} \rightarrow [0,\infty )$ (e.g., the square loss function), and $R(W)$ is a regularizer. As in the vector version of the problem, we can consider $R(W)=\lambda \|W\|_{p}$ .

Proximal operator for matrix norm[edit]

Different matrix norm will lead to different proximal operator. We discuss the proximal operator associated with two types of matrix norm below, and for simplicity, we assume $\lambda =1$ .

Entrywise norm[edit]

An entrywise $p$ -norm is defined as:

\|W\|_{p}=(\sum _{i=1}^{D}\sum _{j=1}^{T}\|W_{i,j}\|^{p})^{\frac {1}{p}}

Since it simply treats matrix $W\in \mathbb {R} ^{D\times T}$ as a vector $\in \mathbb {R} ^{DT}$ , the proximal operator is the same as that of a vector function. For example, when $p$ =1, the proximal operator is entrywise soft thresholding.

Schatten norm[edit]

The Schatten $p$ -norm is defined as:

\|W\|_{p}=(\sum _{j=1}^{m}|\sigma _{j}|^{p})^{\frac {1}{p}}

where $\sigma _{1},...,\sigma _{m}\geq 0,m=\min(T,D)$ are the singular values of $W$ .

Let $\sigma$ be the singular value map: $\mathbb {R} ^{D\times T}\rightarrow \mathbb {R} ^{m}$ , the function that takes a matrix and returns a vector of its singular values in nonincreasing order, then

W=U{\textbf {diag}}(\sigma (W))V

where $U\in \mathbb {R} ^{D\times D}$ and $V\in \mathbb {R} ^{T\times T}$ are unitary matrices, and ${\textbf {diag}}(\sigma (W))\in \mathbb {R} ^{D\times T}$ .

Schatten $p$ -norm is unitarily invariant such that $\|W\|_{p}=\|{\textbf {diag}}(\sigma (W))\|_{p}$ .

It can be further shown ^[22] that the sub-differential of $R(W)$ is:

\partial (\|W\|_{p})=U{\textbf {diag}}[\partial (\|\sigma (W)\|_{p})]V

This implies that:

{\textbf {prox}}_{R}(W)=U{\textbf {diag}}[{\textbf {prox}}_{\|.\|_{p}}(\sigma (W))]V

For example, when $p$ =1 (nuclear norm), the proximal operator is the singular value thresholding:

{\textbf {prox}}_{R}(W)=\sum _{i=1}^{m}(\sigma _{i}-1)_{+}u_{i}v_{i}^{T}

where $(\sigma _{i}-1)_{+}$ is the soft thresholding operator on $\sigma _{i},u_{i}$ is the $i^{th}$ row of $U$ , and $v_{i}^{T}$ is the $i^{th}$ column of $V$ .

Proximal gradient method for learning matrices[edit]

The above results suggest an iterative proximal gradient method for learning matrices:

$W_{t+1}=prox_{\lambda R}(W_{t}-\gamma \nabla {\hat {\mathcal {E}}}(W_{t})),t=0,...,$

for a given some initialization $W_{0}$ .

References[edit]

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Combettes, Patrick L.; Wajs, Valérie R. (2005). "Signal Recovering by Proximal Forward-Backward Splitting". Multiscale Model. Simul. 4 (4): 1168–1200. doi:10.1137/050626090.
^ ^a ^b ^c ^d ^e Mosci, S.; Rosasco, L.; Matteo, S.; Verri, A.; Villa, S. (2010). "Solving Structured Sparsity Regularization with Proximal Methods". Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. 6322: 418–433. doi:10.1007/978-3-642-15883-4_27. ISBN 978-3-642-15882-7.
^ ^a ^b Moreau, J.-J. (1962). "Fonctions convexes duales et points proximaux dans un espace hilbertien". C. R. Acad. Sci. Paris Ser. A Math. 255: 2987–2899.
^ ^a ^b ^c Bauschke, H.H., and Combettes, P.L. (2011). Convex analysis and monotone operator theory in Hilbert spaces. Springer.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ ^a ^b Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso". J. R. Stat. Soc., Ser. B. 1. 58 (1): 267–288.
^ ^a ^b Daubechies, I.; Defrise, M.; De Mol, C. (2004). "An iterative thresholding algorithm for linear inverse problem with a sparsity constraint". Comm. Pure Appl. Math. 57 (11): 1413–1457. doi:10.1002/cpa.20042.{{cite journal}}: CS1 maint: date and year (link)
^ Nesterov, Yurii (1983). "A method of solving a convex programming problem with convergence rate $O(1/k^{2})$ ". Soviet Math. Doklady. 27 (2): 372–376.
^ Nesterov, Yurii (2004). Introductory Lectures on Convex Optimization. Kluwer Academic Publisher.
^ Villa, S.; Salzo, S.; Baldassarre, L.; Verri, A. (2013). "Accelerated and inexact forward-backward algorithms". SIAM J. Optim. 23 (3): 1607–1633. doi:10.1137/110844805.
^ Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, Gl. (2011). "Optimization with sparsity-inducing penalties". Found. & Trends Mach. Learn. 4 (1): 1–106. doi:10.1561/2200000015.
^ Loris, I. (2009). "Accelerating gradient projection methods for $\ell _{1}$ -constrained signal recovery by steplength selection rules". Applied & Comp. Harmonic Analysis. 27 (2): 247–254. doi:10.1016/j.acha.2009.02.003. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Wright, S.J.; Nowak, R.D.; Figueiredo, M.A.T. (2009). "Sparse reconstruction by separable approximation". IEEE Trans. Image Process. 57 (7): 2479–2493. doi:10.1109/TSP.2009.2016892.
^ Loris, Ignace (2009). "On the performance of algorithms for the minimization of $\ell _{1}$ -penalized functionals". Inverse Problems. 25 (3). doi:10.1088/0266-5611/25/3/035008.
^ De Mol, Christine; De Vito, Ernesto; Rosasco, Lorenzo (2009). "Elastic-net regularization in learning theory". J. Complexity. 25 (2): 201–230. doi:10.1016/j.jco.2009.01.002.{{cite journal}}: CS1 maint: date and year (link)
^ Yuan, M.; Lin, Y. (2006). "Model selection and estimation in regression with grouped variables". J. R. Stat. Soc. B. 68 (1): 49–67. doi:10.1111/j.1467-9868.2005.00532.x.
^ Chen, X.; Lin, Q.; Kim, S.; Carbonell, J.G.; Xing, E.P. (2012). "Smoothing proximal gradient method for general structured sparse regression". Ann. Appl. Stat. 6 (2): 719–752. doi:10.1214/11-AOAS514.
^ Mosci, S.; Villa, S.; Verri, A.; Rosasco, L. (2010). "A primal-dual algorithm for group sparse regularization with overlapping groups". NIPS. 23: 2604–2612.
^ ^a ^b Jenatton, R. (2011). "Structured variable selection with sparsity-inducing norms". J. Mach. Learn. Res. 12: 2777–2824. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Zhao, P.; Rocha, G.; Yu, B. (2009). "The composite absolute penalties family for grouped and hierarchical variable selection". Ann. Statist. 37 (6A): 3468–3497. doi:10.1214/07-AOS584.
^ Obozinski, G. (2011). "Group lasso with overlaps: the latent group lasso approach". INRIA Technical Report. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Villa, S.; Rosasco, L.; Mosci, S.; Verri, A. (2012). "Proximal methods for the latent group lasso penalty". Preprint. arXiv:1209.0368.
^ Lewis, A.S. (1995). "The convex analysis of unitary invariant matrix functions". Journal of Convex Analysis. 2 (1/2): 173–183.

First order methods Category:Convex optimization Category:Machine learning

[combettes-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Combettes, Patrick L.; Wajs, Valérie R. (2005). "Signal Recovering by Proximal Forward-Backward Splitting". Multiscale Model. Simul. 4 (4): 1168–1200. doi:10.1137/050626090.

[structSparse-2] Mosci, S.; Rosasco, L.; Matteo, S.; Verri, A.; Villa, S. (2010). "Solving Structured Sparsity Regularization with Proximal Methods". Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. 6322: 418–433. doi:10.1007/978-3-642-15883-4_27. ISBN 978-3-642-15882-7.

[moreau-3] Moreau, J.-J. (1962). "Fonctions convexes duales et points proximaux dans un espace hilbertien". C. R. Acad. Sci. Paris Ser. A Math. 255: 2987–2899.

[bauschke-4] Bauschke, H.H., and Combettes, P.L. (2011). Convex analysis and monotone operator theory in Hilbert spaces. Springer.{{cite book}}: CS1 maint: multiple names: authors list (link)

[tibshirani-5] Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso". J. R. Stat. Soc., Ser. B. 1. 58 (1): 267–288.

[daubechies-6] Daubechies, I.; Defrise, M.; De Mol, C. (2004). "An iterative thresholding algorithm for linear inverse problem with a sparsity constraint". Comm. Pure Appl. Math. 57 (11): 1413–1457. doi:10.1002/cpa.20042.{{cite journal}}: CS1 maint: date and year (link)

[nesterov-7] Nesterov, Yurii (1983). "A method of solving a convex programming problem with convergence rate $O(1/k^{2})$ ". Soviet Math. Doklady. 27 (2): 372–376.

[8] Nesterov, Yurii (2004). Introductory Lectures on Convex Optimization. Kluwer Academic Publisher.

[9] Villa, S.; Salzo, S.; Baldassarre, L.; Verri, A. (2013). "Accelerated and inexact forward-backward algorithms". SIAM J. Optim. 23 (3): 1607–1633. doi:10.1137/110844805.

[bach-10] Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, Gl. (2011). "Optimization with sparsity-inducing penalties". Found. & Trends Mach. Learn. 4 (1): 1–106. doi:10.1561/2200000015.

[11] Loris, I. (2009). "Accelerating gradient projection methods for $\ell _{1}$ -constrained signal recovery by steplength selection rules". Applied & Comp. Harmonic Analysis. 27 (2): 247–254. doi:10.1016/j.acha.2009.02.003. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[12] Wright, S.J.; Nowak, R.D.; Figueiredo, M.A.T. (2009). "Sparse reconstruction by separable approximation". IEEE Trans. Image Process. 57 (7): 2479–2493. doi:10.1109/TSP.2009.2016892.

[13] Loris, Ignace (2009). "On the performance of algorithms for the minimization of $\ell _{1}$ -penalized functionals". Inverse Problems. 25 (3). doi:10.1088/0266-5611/25/3/035008.

[deMolElasticNet-14] De Mol, Christine; De Vito, Ernesto; Rosasco, Lorenzo (2009). "Elastic-net regularization in learning theory". J. Complexity. 25 (2): 201–230. doi:10.1016/j.jco.2009.01.002.{{cite journal}}: CS1 maint: date and year (link)

[groupLasso-15] Yuan, M.; Lin, Y. (2006). "Model selection and estimation in regression with grouped variables". J. R. Stat. Soc. B. 68 (1): 49–67. doi:10.1111/j.1467-9868.2005.00532.x.

[16] Chen, X.; Lin, Q.; Kim, S.; Carbonell, J.G.; Xing, E.P. (2012). "Smoothing proximal gradient method for general structured sparse regression". Ann. Appl. Stat. 6 (2): 719–752. doi:10.1214/11-AOAS514.

[17] Mosci, S.; Villa, S.; Verri, A.; Rosasco, L. (2010). "A primal-dual algorithm for group sparse regularization with overlapping groups". NIPS. 23: 2604–2612.

[nest-18] Jenatton, R. (2011). "Structured variable selection with sparsity-inducing norms". J. Mach. Learn. Res. 12: 2777–2824. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[19] Zhao, P.; Rocha, G.; Yu, B. (2009). "The composite absolute penalties family for grouped and hierarchical variable selection". Ann. Statist. 37 (6A): 3468–3497. doi:10.1214/07-AOS584.

[20] Obozinski, G. (2011). "Group lasso with overlaps: the latent group lasso approach". INRIA Technical Report. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[21] Villa, S.; Rosasco, L.; Mosci, S.; Verri, A. (2012). "Proximal methods for the latent group lasso penalty". Preprint. arXiv:1209.0368.

[Lewis-22] Lewis, A.S. (1995). "The convex analysis of unitary invariant matrix functions". Journal of Convex Analysis. 2 (1/2): 173–183.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

Relevant background[edit]

Moreau decomposition[edit]

Lasso regularization[edit]

Solving for ℓ 1 {\displaystyle \ell _{1}} proximity operator[edit]

Fixed point iterative schemes[edit]

Practical considerations[edit]

Adaptive step size[edit]

Elastic net (mixed norm regularization)[edit]

Exploiting group structure[edit]

Group lasso[edit]

Other group structures[edit]

Method for learning matrices[edit]

Learning matrices[edit]

Proximal operator for matrix norm[edit]

Entrywise norm[edit]

Schatten norm[edit]

Proximal gradient method for learning matrices[edit]

See also[edit]

References[edit]

Solving for $\ell _{1}$ proximity operator[edit]