Latent variable model is a generic term for a broad class of statistical models. The examples include mixture of Gaussians, factor analysis, independent component analysis (ICA), principle component analysis (PCA), and so on.
In a latent variable model, we are able to observe random variables
This turns out to be usually hard to maximize, mainly because we need to compute the likelihood by marginalizing the joint probability.
This integral can be computationally expensive or analytically intractable to compute. A common alternative is to maximize a lower bound of the log likelihood instead of itself. This lower bound of log likelihood is called "free energy", a jargon borrowed from physics.
We have defined the log likelihood, which can be expressed by the marginalized joint probability.
By Jensen and the concavity of the logarithm function , any distribution
We did this "useless" thing of multiplying and dividing
because we need to massage our equation into a ready-to-use-Jensen format. Recall that Jensen tells us: taking the mean of a random variable and then plugging it into a concave function (i.e., here the logarithm function) spits a larger value than plugging into the concave function first and then taking its mean. The inequality in
is in the same format
The free energy can be re-written into the expected joint under distribution
Or, the log-likelihood subtracted by the KL-divergence between
The derivations should look ad hoc freestyling the first time. We re-organize the terms like this because hindsight shows that factoring out entropy or KL-divergence would give us useful expressions. Otherwise, one is not expected to autonomously want to mess around with the terms in this particular way.
The expectation-maximization algorithm maximizing the free energy by updating model parameters
In an E step, we update the distribution of latents. The free energy expression in
We know that the KL-divergence attain its minimum
The interpretation is that we are setting the distribution on
Usually, we will then use the Bayes theorem to write out the posterior.
In the Gaussian mixture model,
In an M step, we update the model parameters. The first free energy expression is useful for this, because the entropy term does not involve
Usually, we will take gradient of the joint with respect to the parameters and set to zero to solve for the parameters update.
We have this summative inequality chain. The superscript means the number of iteration.
The first equality is due to E step. By making the KL-divergence zero, the lower bound on the log likelihood, i.e., free energy, is tightened to equal the log likelihood after an E step.
The second inequality is due to M step. In an M step, the entropy does not change but the expected joint probability increases because M step updates the model parameters to maximize the expected joint.
The third inequality is due to Jensen, which we have shown in equation
Hence, it hold true for every iteration that
This means the log likelihood never decreases in EM algorithms. It is a pretty nice property, because many other optimization-based methods cannot guarantee ones objective function never decrease. It is also a (frustratingly) useful debug tool; because we will know for sure our code has bugs if the log likelihood increases at any iteration.