M HYPE SPLASH
// news

Gradient descent: L2 norm regularization

By Emma Valentine
$\begingroup$

So I've worked out Stochastic Gradient Descent to be the following formula approximately for Logistic Regression to be:

$ w_{t+1} = w_t - \eta((\sigma({w_t}^Tx_i) - y_t)x_t) $

$p(\mathbf{y} = 1 | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})$, where $\sigma(t) = \frac{1}{1 + e^{-t}}$

However, I keep screwing something with when adding L2 Norm Regularization:

From the HW definition of L2 Norm Regularization:

In other words, update $\mathbf{w}_t$ according to $l - \mu \|\mathbf{w}\|^2 $, where $\mathbf{\mu}$ is a constant.

I end up with something like this:

$ w_{t+1} = w_t - \eta((\sigma({w_t}^Tx_i) - y_t)x_t + 2\mu w_t) $

I know this isn't right, where am I making a mistake?

$\endgroup$ 2

2 Answers

$\begingroup$

In your example you doesn't show what cost function do you used to calculate. So, if you'll use the MSE (Mean Square Error) you'll take the equation above.

The MSE with L2 Norm Regularization:

$$ J = \dfrac{1}{2m} \Big[\sum{(σ(w_{t}^Tx_{i}) - y_{t})^2} + \lambda w_{t}^2\Big] $$

And the update function:

$$ w_{t+1} = w_{t} - \dfrac{\gamma}{m}\Big(σ(w_{t}^Tx_{i}) - y_{t}\Big)x_{t} + \dfrac{\lambda}{m} w_{t} $$

And you can simplify to:

$$ w_{t+1} = w_{t}\Big(1 - \dfrac{\lambda}{m}\Big) - \dfrac{\gamma}{m}\Big(σ(w_{t}^Tx_{i}) - y_{t}\Big)x_{t} $$

If you use other cost function you'll take another update function.

$\endgroup$ 1 $\begingroup$

It is common to minimize the negative log likelihood (for one example)$$ l(\mathbf{w}) = - \left\lbrace y \mathbf{w}^T \mathbf{x} - \log (1+\exp(\mathbf{w}^T \mathbf{x})) \right \rbrace $$where $y\in \{0,1\}$ is the example label.

Adding a regularization term yields the cost function$$ \phi(\mathbf{w}) = l(\mathbf{w}) + \frac12 \mu \| \mathbf{w} \|^2 $$The gradient vector is$$ \mathbf{g}(\mathbf{w} ) = \left[ -y + \sigma(\mathbf{w}^T \mathbf{x}) \right] \mathbf{x} + \mu \mathbf{w} $$The gradient descent writes$$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \mathbf{g}(\mathbf{w}^{(t)} ) $$

$\endgroup$

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy