What's the "trick" in log derivative trick?
The following is often referred to as the "log derivative trick".
$$\frac{\nabla_\theta p(X,\theta)}{p(X, \theta)} = \nabla_\theta \log p(X,\theta)$$
For example here, here, and several other places (usually in reference to reinforcement learning)
Is it not just calculus? $\frac{\partial}{\partial x} \log f(x) = \frac{f'(x)}{f(x)}$ Is there anything else going on here?
$\endgroup$ 32 Answers
$\begingroup$It's a "trick", when you use it to calculate $\nabla_\theta p(X,\theta)$ via the (hopefully, sometimes) easier expression $\log p(X,\theta)$. So the use is to write it as $$ \nabla_\theta p(X,\theta)=p(X,\theta)\,\nabla_\theta\log p(X,\theta), $$ in cases where the right-hand-side is easier than the left-hand-side. Typically, when $p$ has lots of products and exponents.
$\endgroup$ 5 $\begingroup$Your are absolutely right, this is "just" calculus. But the real question here is, in what context is this trick used?
If you have an expectation value of the form
$\int d x ~p(x, \theta) f(x)$
with a parametrized probability distribution $p(x, \theta)$. It often happens that you want to calculate the derivative of this expectation value with respect to $\theta$, e.g. to maximize or minimize the expectation value.
The derivative takes the form
$\int d x ~\nabla_{\theta} p(x, \theta) f(x)$.
In practice it can be very difficult to calculate such an integral analytically, so you could estimate it via Monte Carlo sampling. But to do this, you need to bring it to the form:
$\int dx ~p(x) F(x) \approx \frac{1}{n_{MC}} \sum_{x_i} F(x_i)$,
where $x_i$ is sampled from p(x). If we now extend the integral from before by the factor $p(x, \theta)/p(x, \theta)$ and apply the log trick we get.
$\int d x ~p(x, \theta)~\nabla_{\theta} \log (p(x, \theta)) f(x)$
Now with $F(x) = \nabla_{\theta} \log (p(x, \theta)) f(x)$ we obtain exactly the Monte Carlo form from before and we can estimate the integral.
$\endgroup$