Energy Based Models: EBM101 -I

Needless to say, I just came to my uni and I am bored already. However, I was late to the class so it is 0930 hours approximately now, and hopefully it is only three hours more that I have to survive (ugh). But here is a mini-course on EBMs, as introduced by Teh, Osindero and Hinton in 2003 and subsequently elaborated on in many other papers, most prominently Yann LeCun et al’s Tutorial on EBMs. I turned this into a mini-course because I think EBMs are interesting and will be composed of a number of short posts such as this one.

Here is the high-level argument for EBMs. We start with a model $M(p, \mathbf{s}, S, \Psi )$, where $p$ are the probabilities, $\mathbf{s}$ are the sources, $S$ is the entropy and $\Psi$ is the KL-divergence. There were historically two ICA approaches: the (1) information maximization approach and the (2) causative generative approach. EBMs as an ICA approach caught a lot of attention, and later led to the basic idea that the key is to work with canonical ``statistical mechanics” to obtain a reasonable machine learning model, rather than simply brute-force optimizing or modelling. In vanilla ICA, we only have the parameters above, which are good enough in a sense. However, you could see from something like IMA and the mutual information - entropy correspondence that there are some motivations to the more canonical SM style. While there may be subtleties with how they reduce to something like causal generative approach in the presence of noise (these two are equivalent when we consider a non-noisy case), these two approaches paved way for something more powerful: Entropy based models.

At the core of EBMs is the partition function, which acts also in a way as the generating function for the $n$-point correlators or moments. We will talk about this in a different post, but for now we only need to understand the intuition for the partition function. If you follow Vasu Shyam (@vasud3vshyam on Twitter), you must have seen the post identifying the denominator of the $\mathrm{softmax}$ attention function as the partition function. The intuition behind why this works is based on EBMs, and is as follows. The partition function,

$$ Z=\sum \exp \left( \mathcal{X_{i}} \right) $$

appearing in the $\mathrm{softmax}$ function is related to the probabilities as

$$ p(x)=\frac{\exp (-\beta E)}{Z_{\beta }}\;. $$

Here we momentarily restored the inverse temperature $\beta$ for reasons that will be more apparent in later posts. The idea of EBMs is to interpret the energy $E$ as a minimization parameter and get inference from $\min E$. The idea for this minimization problem is very similar to that of the negative log-likelihood optimization problem; there is no exact reason to use this sort of a deterministic approach over IMAs — for that matter, there are reasons to not use EBMs over IMAs or CGAs for no reason. Overall, this reduces to the more familiar sort of optimization algorithms with say cross-entropy loss $J_{CE}$, which is the difference of $E$ and the free energy of the ensemble $F$ which together become your familiar negative log-likelihood function. This free energy (Helmholtz free energy to be more precise) is given by $\log Z_{\beta }$, and $J_{CE}$ takes the form

$$ J_{CE}=\beta E+F_{\beta }\;, $$

where we can expand $F_{\beta }$ into

$$ F_{\beta }=\beta ^{-1}\log \left( \int \exp (-\beta E) \right)\;. $$

What is interesting about the EBM approach to loss functions is that more often than not, a given loss function may not be compatible with $\min E$, because our requirement is also that training gradients the energy for each model prediction in a way that pushes the energy of the most probable generation down, while pushing the energy for every other answer up. In Yann et al’s paper, they explicitly talk about the different kinds of loss functions and the compatibility arguments.

From Yann LeCun et al 2006.

In the next post, we will discuss these loss functions and the gradient optimization with $\min E$.

As a side note, I will be linking all these posts on my blog along with the titles and timestamps, since I am not sure how to make a homepage with a blog-like feel on Notion. In case there are some issues with updating or access, I will update them on a status page I will make soon. Addaam reshii a zaanta!

October 29, 2024, 2136 Polish Time.