$$ \DeclareMathOperator*{\E}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} $$

Machine Learning and Statistical Mechanics

What do ML and SM have in common?

Both fields use probabilistic models with large numbers of variables
There are theoretical concepts and tools that apply to both

Probability in Statistical Mechanics

Goal: describe thermodynamic properties of macroscopic system probabilistically in terms of microscopic constituents.
The probabilistic model is normally the Boltzmann distribution

$$ p(\mathbf{x})=\frac{\exp\left[-\beta \mathcal{E}(\mathbf{x})\right]}{Z}, $$

Normalizing constant $Z$ is partition function, $\mathcal{E}(\mathbf{x})$ is the energy of configuration $\mathbf{x}$, and $\beta=1/T$ is inverse temperature
Central problem of SM: computing averages of physical quantities
Principle difficulty: this is hard

Example I: Classical gas

$\mathbf{x}$ corresponds to positions of each gas molecule: $\mathbf{x}=(\mathbf{x}_1,\ldots \mathbf{x}_N)$
Average is a $3N$-dimensional integral
Only tractable case: noninteracting (ideal) gas, in which case

$$ \mathcal{E}(\mathbf{x}) = \sum_{n=1}^N \mathcal{E}_1(\mathbf{x}_n) $$

If we introduce interactions between particles of the form $$ \mathcal{E}(\mathbf{x}) = \sum_{n<m}^N \mathcal{E}_2(\mathbf{x}_n,\mathbf{x}_m) $$ things get a lot harder.

Example 2: Ising model

Can also have discrete random variables, e.g. Ising model
Configuration corresponds to fixing the values of $N$ “spins” $\sigma_n=\pm 1$ with an energy function of the form $$ \mathcal{E}(\sigma)=\sum_n h_n\sigma_n + \sum_{n,m} J_{mn}\sigma_m\sigma_n. $$ It’s the couplings $J_{mn}$ that causes problems / interest

Worst case: sum over $2^N$ configurations
Solve approximately with: mean field theory, Monte Carlo, etc.

Probability in Machine Learning

Example: computer vision.
Image defined by set $(R,G,B)$ values of each pixel each $\in [0,255]$
Basic hypothesis of probabilistic ML

Dataset represents a set of independent and identically distributed (iid) samples of some random variables.

For image, variables are RGB values of pixels
Distribution has to be highly correlated and have a great deal of complex structure: cats and dogs not white noise

SM vs. ML

Classical SM: motion of molecules deterministic but complicated. Replace with probability model constrained by physics
ML: (mostly) rely solely on data. Infer properties of model. How?
Recent prgress using models based on NNs + training algorithms
Allows rich probability models describing images or audio signals

SM vs. ML

Glow, Diederik P. Kingma, Prafulla Dhariwal, arXiv:1807.03039

Applications of probablistic ML

Sampling (generative modelling). Obvious physics applications
Density estimation. Anomaly detection
Compression

This lecture

Some mathematical background
Variational inference

Some mathematical background

Probabilities

Probabilities $p(x)\geq 0$ satisfy

$$ \sum_x p(x)=1 $$

For continuous variables $$ \int p(x) dx=1, $$ but we’ll use the discrete notation throughout.

Joint probabilities denoted $p(x_1,\ldots x_N)$
Sum over subset to give marginal distribution of remaining

$$ p(x)= \sum_{y} p(x,y). $$

Conditional probability $p(x|y)$: distribution of $x$ given fixed $y$. The relation between joint and conditional probabilities is

$$ p(x,y)=p(x|y)p(y) \tag{1} \label{eq:joint} $$

Chain rule of probability

$$ p(x_1,\ldots x_N)=p(x_1)p(x_2|x_1)p(x_3|x_2,x_1)\cdots p(x_N|x_1,\ldots x_{N-1}), \tag{2} \label{eq:chain} $$

c.f. Autoregressive models
Sampling is easy!

Priors and posteriors

Another way to express the joint probability is

$$ p(x,y)=p(y|x)p(x) $$

We deduce Bayes’ theorem

$$ p(y|x)=\frac{p(x|y)p(y)}{p(x)} $$

Bayesian statistics

Bayes’ theorem is workhorse of Bayesian statistics
Regard parameters $z$ in your probability model as random variables taken from some initial distribution $p(z)$, called the prior distribution (or just the prior)

Example

Model distribution is Gaussian normal distribution with mean $\mu$ and variance $\sigma^2$
Parameters are $z=(\mu,\sigma^2)$
For prior could choose a normal distribution: $\mu\sim \mathcal{N}(\mu_\mu,\sigma^2_\mu))$
For $\sigma^2$ a distribution of a positive quantity: the inverse gamma distribution is a popular choice.

Once parameters fixed, have a model distribution for your data that can be thought of as the conditional distribution $p(x|z)$
What does an observation of $x$ tell me? Just use Bayes:

$$ p(z|x) = \frac{p(x|z)p(z)}{p(x)}. $$

This is the posterior distribution (or just posterior)
Note that the denominator doesn’t depend on $z$, it just provides a normalization. If you have lots of data points then

$$ p(z|x_1,\ldots x_N) \propto p(x_1,\ldots x_N|z)p(z). $$

Bayes’ theorem lets us update our beliefs about parameters based on our initial beliefs and any evidence we receive. This is inference.

Bayesian https://t.co/HqojdMeaan (click for full comic)#smbc #hiveworks pic.twitter.com/Wijy3kz5cs
— Zach Weinersmith (@ZachWeiner) November 8, 2020

Latent variables; generative models

We allow the $z$s to have different distributions for different data points $p(z_n|x_n)$
Equivalently, our model is defined by a joint distribution $p(x,z)$.

Example: mixture model

$M$ different components, each with their own distribution $p(x|m)$ and occurring with probability $p(m)$, so that

$$ p(x) = \sum_m p(m)p(x|m). $$

Observation $x$ will give me information about $p(m|x)$, telling which of the $M$ components that observation belongs to.
This may bring insight, if latent variables are interpretable
Or: a more powerful model

Latent variables allow for structure learning
Example: for a dataset of images of people walking we’d like to find latent variables parameterize a manifold of different poses.
Latent variable models are also the basis of generative modelling: sampling from a distribution $p(x)$ learnt from data.
If the model has been formulated in terms of a prior $p(z)$ over latent variables and a generative model $p(x|z)$, sampling is straightforward in principle.

Entropy

In SM we’re familiar with entropy associated with probability distribution.
Arrived in ML from information theory

$$ H[p]=- \sum_x p(x)\log_2 p(x). $$

Taking the logarithm base 2 means we measure in bits

One interpretation of entropy

$N$ iid variables with distribution $p(x)$
Probability of observing a sequence $x_1,\ldots x_N$ is

$$ \begin{equation} p(x_1,\ldots x_N)=\prod_{n=1}^N p(x_n). \end{equation} \tag{3} \label{eq:seq} $$

Probability is obviously exponentially small as $N\to\infty$, but how small?

$$ \lim_{N\to\infty} \frac{1}{N}\log p(x_1,\ldots x_N) = -H[p]. $$

Asymptotic partition property
Shouldn’t the probability depend on what you actually get?
Suppose you have a biased coin that give heads with probability $p_H>0.5$ and tails with probability $p_T=1-p_H$
Chance of getting half heads and half tails exponentially small

What you’re going to get instead is

$$ \frac{N_H}{N}\to p_H\qquad \frac{N_T}{N}\to p_T\qquad . $$

What is the probability of such a sequence? $p_H^{N_H}p_T^{N_T}$

$$ \log_2\left(p_H^{N_H}p_T^{N_T}\right)= N_H\log_2 p_H + N_T\log_2 p_T = -N H[p_H, p_T]. $$

Entropy and information

A way to quantify information in a signal
If the coin is really biased, you will be surprised when you get tails
Entropy lower than for fair coin, which has maximum entropy $H=1$

HHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHTHHHHT

To describe such a sequence, you might say “21 H, 13 H, 4 H”
Shorter than the original sequence; possible because of the high degree of predictability
But: extra symbols including the digits 0-9 and comma.
Should instead compare with a binary code of only two symbols
How can we exploit the lower entropy of the sequence?

Use fact that we expect $N_H=Np_H$ heads and $N_T=N p_T$ tails, so we can just give the ordering of these heads and tails, which is one of $$ \frac{N!}{N_H! N_T!} $$ possibilities. If we label each of these with a binary number, we end up with a description of length $$ \log_2\left(\frac{N!}{N_H! N_T!}\right)\sim N H[p]\leq N $$ (where we used Stirling’s approximation $\log n! \sim n\log n -n$).

Simplest illustration of Shannon’s source coding theorem:

N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.

Shannon’s theorem is the core idea that underlies (lossless) data compression
The more predictable a signal (i.e. the lower the entropy) the more it can be compressed, with the entropy setting a fundamental limit on the number of bits required.

Divergences

We need some way of talking about the degree to which two distributions differ
Most common measure in use in ML is the Kullback–Leibler divergence (KL)

$$ D_\text{KL}(p||q)=\sum_x p(x)\log\left(\frac{p(x)}{q(x)}\right)=\E_{x\sim p}\log\left(\frac{p(x)}{q(x)}\right). $$

KL has property

$$ D_\text{KL}(p||q)\geq 0 $$

Consequence of Jensen’s inequality. For a convex function $\varphi(x)$

$$ \E\left[\varphi(x))\right]\geq \varphi\left(\E\left[x\right]\right) $$

Apply this to the KL then, using the convexity of $\varphi(x)=-\log(x)$ $$ D_\text{KL}(p||q)=-\E_{x\sim p}\log\left(\frac{q(x)}{p(x)}\right)\geq -\log\left(\E_{x\sim p}\left[\frac{q(x)}{p(x)}\right]\right)=-\log(1)=0, $$ with equality if and only if $p=q$.

Variational inference (VI)

Recall Bayes'

$$ p(z|x) = \frac{p(x|z)p(z)}{p(x)}=\frac{p(x,z)}{p(x)}. $$

Complicated latent variable model $\longrightarrow$ intractable denominator

Mean field theory

For SM model like Ising the probability has the form

$$ p(\sigma) = \frac{\exp\left[-\beta\cE(\sigma)\right]}{Z}. $$

$$ q_\theta(\sigma)=\prod_n q_{\theta_n}(\sigma_n). $$

Natural to try to minimize

$$ D_\text{KL}(q||p)(q||p)=\E_{\sigma\sim q_\theta}\left[\log\left(\frac{q_\theta(\sigma)}{p(\sigma)}\right)\right]. $$

Substituting in the Boltzmann distribution $$ D_\text{KL}(q||p)(q||p)= \log Z - H[q_\theta] + \beta \E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]\geq 0, $$ or in usual SM language $$ \E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]-TH[q_\theta] \geq F, $$ where $F=-T\log Z$ is the Helmholtz free energy.
This is the Bogoliubov or Gibbs inequality

For Ising spins our factorized distributions are defined by fields on each site $$ q_{\theta_n}(\sigma_n) = \frac{\exp\left[-\beta\theta_n\sigma_n\right]}{2\cosh (\beta\theta_n)}, $$ with average spin $$ \E_{\sigma_n\sim q_n}\left[\sigma_n\right] = -\tanh\left(\beta\theta_n\right). $$

VI in latent variable models

Just need to replace the Boltzmann distribution with $$ p(z|x) =\frac{p(x,z)}{p_\text{M}(x)}. $$ (we add the subscript “M” for model)
Role of spins $\sigma$ is now played by the latent variables
Following same steps leads us to

$$ \log p_\text{M}(x) \geq \E_{z\sim q_\theta(\cdot|x)}\left[\log p(x,z)\right]+ H[q_\theta(\cdot|z)]. $$

RHS is Evidence lower bound or ELBO (marginalized probability $p(x)$ on the left is sometimes called the model evidence).
Possible to rewrite as $$ \log p_\text{M}(x) \geq \log p_\text{M}(x) - D_\text{KL}(q_\theta(\cdot|x)||p(\cdot|x)), $$ so the bound is saturated when the variational posterior for the latent variables coincides with the true posterior $$ p(z|x)=p(x,z)/p_\text{M}(x) $$