derivative of cost function for Logistic Regression
I am going over the lectures on Machine Learning at Coursera.
I am struggling with the following. How can the partial derivative of
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
where $h_{\theta}(x)$ is defined as follows
$$h_{\theta}(x)=g(\theta^{T}x)$$$$g(z)=\frac{1}{1+e^{-z}}$$
be $$ \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i$$
In other words, how would we go about calculating the partial derivative with respect to $\theta$ of the cost function (the logs are natural logarithms):
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
$\endgroup$ 28 Answers
$\begingroup$The reason is the following. We use the notation:
$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$
Then
$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]
Since our original cost function is the form of:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
Plugging in the two simplified expressions above, we obtain$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to:$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$
where the second equality follows from
$$-\theta x^i-\log(1+e^{-\theta x^i})= -\left[ \log e^{\theta x^i}+ \log(1+e^{-\theta x^i} ) \right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]
All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As$$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$$$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$
the thesis follows.
$\endgroup$ 16 $\begingroup$You have to get the partial derivative with respect $\theta_j$. Remember that the hypothesis function here is equal to the sigmoid function which is a function of $\theta$; in other words, we need to apply the chain rule. This is my approach:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
$$\frac{\partial}{\partial\theta_{j}}J(\theta) = \frac{\partial}{\partial\theta_{j}} [-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i})) ]$$
Anything without $\theta$ is treated as constant:
$$ \tag{1} \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}y^{i}\frac{\partial}{\partial\theta_{j}}[\log(h_\theta(x^{i}))]+(1-y^{i})\frac{\partial}{\partial\theta_{j}}[\log(1-h_\theta(x^{i})) ]$$
Let's solve each derivative separately and then plug back in on (1):
$$\tag{2} \frac{\partial}{\partial\theta_{j}}[\log(h_\theta(x^{i}))] = \frac{1}{h_\theta(x^{i})} \frac{\partial}{\partial\theta_{j}} h_\theta(x^{i})$$
$$ \tag{3} \frac{\partial}{\partial\theta_{j}}[\log(1 - h_\theta(x^{i}))] = \frac{1}{1 - h_\theta(x^{i})} \frac{\partial}{\partial\theta_{j}} (1 -h_\theta(x^{i}) = \frac{-1}{1 - h_\theta(x^{i})} \frac{\partial}{\partial\theta_{j}} h_\theta(x^{i}) $$
Plug (3) and (2) in (1):
$$ \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}y^{i} \frac{1}{h_\theta(x^{i})}\frac{\partial}{\partial\theta_{j}} h_\theta(x^{i}) +(1-y^{i}) \frac{-1}{1 - h_\theta(x^{i})} \frac{\partial}{\partial\theta_{j}} h_\theta(x^{i}) ]$$
$$\tag{4} \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} [ \frac{y^{i}}{h_\theta(x^{i})} - \frac{(1-y^{i})}{1 - h_\theta(x^{i})} ] * \frac{\partial}{\partial\theta_{j}} h_\theta(x^{i})$$
Notice that using the chain rule, the derivative of the hypothesis function can be understood as $$\tag{5}\frac{\partial}{\partial\theta_{j}}[\ h_\theta(x^{i})] = \frac{\partial}{\partial z }[\ h(z)] * \frac{\partial}{\partial\theta_{j}}[\ z(\theta)] = [h(z) * [1 - h(z) ]] *[x_j^i] $$
where
$$ \frac{\partial}{\partial z }[\ h(z)] = \frac{\partial}{\partial z } \frac{1}{1+e^{-z}} = \frac{0 - (1)*(1+e^{-z})'}{(1+e^{-z})^2} = \frac{ (e^{-z})}{(1+e^{-z})^2} = [\frac{1}{(1+e^{-z})}] * [\frac{ (e^{-z})}{(1+e^{-z})}] = [\frac{1}{(1+e^{-z})}] * [1 -\frac{1}{(1+e^{-z})}] = h(z) * [1 - h(z) ] $$and $$\frac{\partial}{\partial\theta_{j}}[\ z(\theta)] = \frac{\partial}{\partial\theta_{j}}[\ \theta x^i] = x_j^i $$
Plug (5) in (4):
$$ \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} [ \frac{y^{i}}{h_\theta(x^{i})} - \frac{(1-y^{i})}{1 - h_\theta(x^{i})} ] * [ h_\theta(x^{i}) * ( 1 -h_\theta(x^{i})) * x_j^i ]$$
Applying some algebra and solving subtraction:
$$\frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i$$
There is a $1/m$ factor missing on your expected answer.
Hope this helps.
$\endgroup$ $\begingroup$@pedro-lopes, it is called as: chain rule.$$(u(v))' = u(v)' * v'$$For example:$$y = \sin(3x - 5)$$$$u(v) = \sin(3x - 5)$$$$v = (3x - 5)$$$$y' = \sin(3x - 5)' = \cos(3x - 5) * (3 - 0) = 3\cos(3x-5)$$
Regarding: $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}$$$$u(v) = \log(1+e^{\theta x^i})$$$$v = 1+e^{\theta x^i}$$$$\frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) = \frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) * \frac{\partial}{\partial \theta}(1+e^{\theta x^i}) = \frac{1}{1+e^{\theta x^i}} * (0 + xe^{\theta x^i}) = \frac{xe^{\theta x^i}}{1+e^{\theta x^i}} $$Note that $$\log(x)' = \frac{1}{x}$$Hope that I answered on your question!
$\endgroup$ $\begingroup$We have, \begin{align*} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i. log P(y_i|x_i,\theta) + (1-y_i). \log{(1 - P(y_i|x_i,\theta))}} \\ h_\theta(x_i) &= P(y_i|x_i,\theta) = P(y_i=1|x_i,\theta) = \frac{1}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)}} \end{align*}
Then, \begin{align*} \log{(P(y_i|x_i,\theta))}=\log{(P(y_i=1|x_i,\theta))} &=-\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) =\frac{x_i^j.\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}} &= x_i^j.\left(1-P(y_i|x_i,\theta)\right) \end{align*} and \begin{align*} \log{(1-P(y_i|x_i,\theta))}=\log{(1-P(y_i=1|x_i,\theta))} &=-\sum\limits_k \theta_k x_i^k -\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))} &= -x_i^j + x_i^j.\left(1-P(y_i|x_i,\theta)\right) = -x_i^j.P(y_i|x_i,\theta) \\ \end{align*}
Hence,
\begin{align*} \frac{\partial }{\partial \theta_j} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.\frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) + (1-y_i).\frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))}} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j.\left(1-P(y_i|x_i,\theta)\right) - (1-y_i).x_i^j.P(y_i|x_i,\theta)} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j - x_i^j.P(y_i|x_i,\theta)} \\ &=\frac{1}{m}\sum\limits_{i=1}^{m}{(P(y_i|x_i,\theta)-y_i).x_i^j} \end{align*} (Proved)
$\endgroup$ 3 $\begingroup$Pedro, => partial fractions
$$\log(1 - \frac{a}{b})$$
$$1 - \frac{a}{b} = \frac{b}{b} - \frac{a}{b} = \frac{b-a}{b},$$ $$\log(1 - \frac{a}{b}) = \log(\frac{b-a}{b}) = \log(b-a) - \log(b)$$
$\endgroup$ 0 $\begingroup$$${ J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^i\log(h_\theta(x^i))+(1-y^i)\log(1-h_\theta(x^i)) }$$where $h_\theta(x)$ is defined as follows$${ h_\theta(x)=g(\theta^Tx), }$$$${ g(z)=\frac{1}{1+e^{-z}} }$$Note that $g(z)'=g(z)*(1-g(z))$ and we can simply write right side of summation as $${ y\log(g)+(1-y)\log(1-g) }$$and the derivative of it as$${ y \frac{1}{g}g'+(1-y) \left( \frac{1}{1-g}\right) (-g') \\ =\left( \frac{y}{g}- \frac{1-y}{1-g}\right) g' \\ = \frac{y(1-g)-g(1-y)}{g(1-g)}g' \\ = \frac{y-y*g-g+g*y}{g(1-g)}g' \\ = \frac{y-y*g-g+g*y}{g(1-g)}g(1-g)*x \\ =(y-g)*x }$$
and then we can rewrite above as$${ \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i }$$
$\endgroup$ 1 $\begingroup$Notice that,
$$\frac{\partial}{\partial\theta_j}y_i\theta x^i = \frac{\partial}{\partial\theta_j}y_i(\theta_0 + \theta_1x^i_1 + ... + \theta_jx^i_j)= $$in this $\partial\theta_j$ order derivative, $y_i$ is a constant, so $$=y_i\frac{\partial}{\partial\theta_j}(\theta_0 + \theta_1x^i_1 + ... + \theta_jx^i_j)=$$because it is a linear model ($\frac{\partial}{\partial \theta}k\theta = k$), so$$=y_i(0 + x^i_1 + ... + x^i_j)=$$$$=y_ix^i_j$$Finally,$$\frac{\partial}{\partial\theta_j}y_i\theta x^i = y_ix^i_j$$
$\endgroup$ $\begingroup$$ \def\o{{\tt1}}\def\p{\partial}\def\J{{\cal J}} \def\LR#1{\left(#1\right)} \def\BR#1{\Bigl(#1\Bigr)} \def\diag#1{\operatorname{diag}\LR{#1}} \def\diagb#1{\operatorname{diag}\BR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\Diagb#1{\operatorname{Diag}\BR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $For ease of typing, replace the Greek symbol $(\theta\to w)\,$and collect all of the $x_k$ vectors into a matrix, i.e.$$\eqalign{ X = {\tt[}x_1\;x_2\ldots\,x_m {\tt]} \\ }$$What you have called $g(z)$ is actually the logistic function which has a well-known derivative$$\frac{dg}{dz} = (1-g)\,g \qif dg = (1-g)\,g\;dz$$When applied elementwise to the vector argument $(X^Tw),\,$it produces a vector result$$\eqalign{ h &= g(X^Tw) \\ dh &= \LR{\o-h}\odot h\odot d(X^Tw) \\ &= \LR{\o-h}\odot h\odot (X^Tdw) \\ }$$where $(\odot)$ denotes the elementwise/Hadamard product.
But a Hadamard product with a vector can be replaced by the standard product by using a diagonal matrix created from the vector. Therefore$$\eqalign{ H &= \Diag h &\qif h = \diag H = H\o \\ dh &= \LR{I-H}HX^Tdw &\qif \grad hw = \LR{I-H}HX^T \\ }$$The cost function can now be expressed in a purely matrix form$$\eqalign{ Y &= \Diag y \\ \J &= -\fracLR 1m\BR{Y:\log(H)+(I-Y):\log(I-H)} \\ }$$where $(:)$ denotes the Frobenius inner product$$A:B = \trace{A^TB} = \trace{AB^T}$$Since diagonal matrices are almost as easy to work with as scalars, it becomes a rather straightforward if tedious exercise to calculate the gradient$$\eqalign{ d\J &= -\fracLR 1m\BR{Y:d\log(H)+(I-Y)\,:\,d\log(I-H)} \\ &= -\fracLR 1m\BR{Y:H^{-1}dH \;-\; (I-Y)\,:\,(I-H)^{-1}dH} \\ &= -\fracLR 1m\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,:\,\Diag{dh} \\ &= -\fracLR 1m\diagb{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,:\,dh \\ &= -\fracLR 1m\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,\o\,:\,\LR{I-H}HX^Tdw \\ &= -\fracLR 1mX\LR{I-H}H\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{\LR{I-H}Y \;+\; H(Y-I)}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{Y-HY \;+\;HY - H}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{Y-H}\,\o\,:\,dw \\ &= +\fracLR 1mX\BR{h-y}\,:\,dw \\ \grad{\J}{w} &= \fracLR 1mX\BR{h-y} \\ }$$
$\endgroup$ 0