Free Energy and EM Algorithm

Latent Variable Model

Latent variable model is a generic term for a broad class of statistical models. The examples include mixture of Gaussians, factor analysis, independent component analysis (ICA), principle component analysis (PCA), and so on.

$\mathcal X$ $\mathcal Z$ $\theta$ $\mathcal X$ $\mathcal Z$ $\theta$ that maximize the log likelihood.

ℓ (θ) \overset{def}{=} \ln p (X | θ)

This turns out to be usually hard to maximize, mainly because we need to compute the likelihood by marginalizing the joint probability.

p (X | θ) = \int d Z p (Z, X | θ)

This integral can be computationally expensive or analytically intractable to compute. A common alternative is to maximize a lower bound of the log likelihood instead of itself. This lower bound of log likelihood is called "free energy", a jargon borrowed from physics.

Free Energy

Free energy is a lower bound of log likelihood

We have defined the log likelihood, which can be expressed by the marginalized joint probability.

ℓ (θ) \overset{def}{=} \ln p (X | θ) = \ln \int d Z p (Z, X | θ)

$q(\mathcal Z)$ $l(\theta)$ $\mathcal F(q,\theta)$ .

\begin{matrix} (1) & ℓ (θ) = \ln \int d Z q (Z) \frac{p (Z, X | θ)}{q (Z)} \geq \int d Z q (Z) \ln \frac{p (Z, X | θ)}{q (Z)} \overset{def}{=} F (q, θ) \end{matrix}

$q(\mathcal Z)$ $X$ $X$ into the concave function first and then taking its mean.
$\ln ⟨ X ⟩ \geq ⟨ \ln X ⟩$
$(1)$ is in the same format
$\ln {⟨ \frac{p (Z, X | θ)}{q (Z)} ⟩}_{q (Z)} \geq {⟨ \ln \frac{p (Z, X | θ)}{q (Z)} ⟩}_{q (Z)}$

Two useful ways to re-write free energy

$q(\mathcal Z)$ $q(\mathcal Z)$ .

\begin{aligned} F (q, θ) & = \int d Z q (Z) \ln \frac{p (Z, X | θ)}{q (Z)} \\ = \int d Z q (Z) \ln p (Z, X | θ) - \int d Z q (Z) \ln q (Z) \\ (2) & = {⟨ \ln p (Z, X | θ) ⟩}_{q (Z)} + H (q) \end{aligned}

$q(\mathcal Z)$ and the true posterior of latents.

\begin{aligned} F (q, θ) & = \int d Z q (Z) \ln \frac{p (X | θ) p (Z | X, θ)}{q (Z)} \\ = \int d Z q (Z) \ln p (X | θ) + \int d Z q (Z) \ln \frac{p (Z | X, θ)}{q (Z)} \\ = \ln p (X | θ) \int d Z q (Z) - \int d Z q (Z) \ln \frac{q (Z)}{p (Z | X, θ)} \\ (3) & = \ln p (X | θ) - KL [q (Z) | | p (Z | X, θ)] \end{aligned}

The derivations should look ad hoc freestyling the first time. We re-organize the terms like this because hindsight shows that factoring out entropy or KL-divergence would give us useful expressions. Otherwise, one is not expected to autonomously want to mess around with the terms in this particular way.

EM

$\theta$ $\mathcal Z$ in alternation.

E step

$(3)$ $\mathcal Z$ $q(\mathcal Z)$ is equivalent to minimizing the KL-divergence.

q (Z) = \arg max_{q} F (q, θ) = \arg min_{q} KL [q (Z) | | p (Z | X, θ)]

$0$ if and only if

q (Z) = p (Z | X, θ)

$\mathcal Z$ to the true posterior under the data set and the current parameters.

Usually, we will then use the Bayes theorem to write out the posterior.

p (Z | X, θ) = \frac{p (X | Z, θ) p (Z | θ)}{\int d Z p (X | Z, θ) p (Z | θ)}

$q(\mathcal Z)$ appears in the name of responsibility, meaning how much it is a certain mixture's responsibility to have generated a data point.

M step

$\theta$ $\theta$ $q(\mathcal Z)$ .

θ = \arg max_{θ} F (q, θ) = \arg max_{θ} {⟨ \ln p (Z, X | θ) ⟩}_{q (Z)}

Usually, we will take gradient of the joint with respect to the parameters and set to zero to solve for the parameters update.

\frac{\partial}{\partial θ} {⟨ \ln p (Z, X | θ) ⟩}_{q (Z)} \overset{set}{=} 0

Convergence of EM

We have this summative inequality chain. The superscript means the number of iteration.

ℓ (θ^{(t - 1)}) = F (q^{(t)}, θ^{(t - 1)}) \leq F (q^{(t)}, θ^{(t)}) \leq ℓ (θ^{(t)})

The first equality is due to E step. By making the KL-divergence zero, the lower bound on the log likelihood, i.e., free energy, is tightened to equal the log likelihood after an E step.

The second inequality is due to M step. In an M step, the entropy does not change but the expected joint probability increases because M step updates the model parameters to maximize the expected joint.

$(1)$ .

Hence, it hold true for every iteration that

ℓ (θ^{(t - 1)}) \leq ℓ (θ^{(t)})

This means the log likelihood never decreases in EM algorithms. It is a pretty nice property, because many other optimization-based methods cannot guarantee ones objective function never decrease. It is also a (frustratingly) useful debug tool; because we will know for sure our code has bugs if the log likelihood increases at any iteration.