$$ \DeclareMathOperator*{\E}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} $$
Both fields use probabilistic models with large numbers of variables
There are theoretical concepts and tools that apply to both
Goal: describe thermodynamic properties of macroscopic system probabilistically in terms of microscopic constituents.
The probabilistic model is normally the Boltzmann distribution
$$ p(\mathbf{x})=\frac{\exp\left[-\beta \mathcal{E}(\mathbf{x})\right]}{Z}, $$
Normalizing constant $Z$ is partition function, $\mathcal{E}(\mathbf{x})$ is the energy of configuration $\mathbf{x}$, and $\beta=1/T$ is inverse temperature
Central problem of SM: computing averages of physical quantities
Principle difficulty: this is hard
$\mathbf{x}$ corresponds to positions of each gas molecule: $\mathbf{x}=(\mathbf{x}_1,\ldots \mathbf{x}_N)$
Average is a $3N$-dimensional integral
Only tractable case: noninteracting (ideal) gas, in which case
$$ \mathcal{E}(\mathbf{x}) = \sum_{n=1}^N \mathcal{E}_1(\mathbf{x}_n) $$
Can also have discrete random variables, e.g. Ising model
Configuration corresponds to fixing the values of $N$ “spins” $\sigma_n=\pm 1$ with an energy function of the form $$ \mathcal{E}(\sigma)=\sum_n h_n\sigma_n + \sum_{n,m} J_{mn}\sigma_m\sigma_n. $$ It’s the couplings $J_{mn}$ that causes problems / interest
Worst case: sum over $2^N$ configurations
Solve approximately with: mean field theory, Monte Carlo, etc.
Example: computer vision.
Image defined by set $(R,G,B)$ values of each pixel each $\in [0,255]$
Basic hypothesis of probabilistic ML
Dataset represents a set of independent and identically distributed (iid) samples of some random variables.
For image, variables are RGB values of pixels
Distribution has to be highly correlated and have a great deal of complex structure: cats and dogs not white noise
Classical SM: motion of molecules deterministic but complicated. Replace with probability model constrained by physics
ML: (mostly) rely solely on data. Infer properties of model. How?
Recent prgress using models based on NNs + training algorithms
Allows rich probability models describing images or audio signals
$$ \sum_x p(x)=1 $$
Joint probabilities denoted $p(x_1,\ldots x_N)$
Sum over subset to give marginal distribution of remaining
$$ p(x)= \sum_{y} p(x,y). $$
$$ p(x,y)=p(x|y)p(y) \tag{1} \label{eq:joint} $$
$$ p(x_1,\ldots x_N)=p(x_1)p(x_2|x_1)p(x_3|x_2,x_1)\cdots p(x_N|x_1,\ldots x_{N-1}), \tag{2} \label{eq:chain} $$
Sampling is easy!
$$ p(x,y)=p(y|x)p(x) $$
$$ p(y|x)=\frac{p(x|y)p(y)}{p(x)} $$
Bayes’ theorem is workhorse of Bayesian statistics
Regard parameters $z$ in your probability model as random variables taken from some initial distribution $p(z)$, called the prior distribution (or just the prior)
Model distribution is Gaussian normal distribution with mean $\mu$ and variance $\sigma^2$
Parameters are $z=(\mu,\sigma^2)$
For prior could choose a normal distribution: $\mu\sim \mathcal{N}(\mu_\mu,\sigma^2_\mu))$
For $\sigma^2$ a distribution of a positive quantity: the inverse gamma distribution is a popular choice.
Once parameters fixed, have a model distribution for your data that can be thought of as the conditional distribution $p(x|z)$
What does an observation of $x$ tell me? Just use Bayes:
$$ p(z|x) = \frac{p(x|z)p(z)}{p(x)}. $$
This is the posterior distribution (or just posterior)
Note that the denominator doesn’t depend on $z$, it just provides a normalization. If you have lots of data points then
$$ p(z|x_1,\ldots x_N) \propto p(x_1,\ldots x_N|z)p(z). $$
Bayesian https://t.co/HqojdMeaan (click for full comic)#smbc #hiveworks pic.twitter.com/Wijy3kz5cs
— Zach Weinersmith (@ZachWeiner) November 8, 2020
We allow the $z$s to have different distributions for different data points $p(z_n|x_n)$
Equivalently, our model is defined by a joint distribution $p(x,z)$.
$$ p(x) = \sum_m p(m)p(x|m). $$
Observation $x$ will give me information about $p(m|x)$, telling which of the $M$ components that observation belongs to.
This may bring insight, if latent variables are interpretable
Or: a more powerful model
Latent variables allow for structure learning
Example: for a dataset of images of people walking we’d like to find latent variables parameterize a manifold of different poses.
Latent variable models are also the basis of generative modelling: sampling from a distribution $p(x)$ learnt from data.
If the model has been formulated in terms of a prior $p(z)$ over latent variables and a generative model $p(x|z)$, sampling is straightforward in principle.
In SM we’re familiar with entropy associated with probability distribution.
Arrived in ML from information theory
$$ H[p]=- \sum_x p(x)\log_2 p(x). $$
$N$ iid variables with distribution $p(x)$
Probability of observing a sequence $x_1,\ldots x_N$ is
$$ \begin{equation} p(x_1,\ldots x_N)=\prod_{n=1}^N p(x_n). \end{equation} \tag{3} \label{eq:seq} $$
$$ \lim_{N\to\infty} \frac{1}{N}\log p(x_1,\ldots x_N) = -H[p]. $$
Shouldn’t the probability depend on what you actually get?
Suppose you have a biased coin that give heads with probability $p_H>0.5$ and tails with probability $p_T=1-p_H$
Chance of getting half heads and half tails exponentially small
$$ \frac{N_H}{N}\to p_H\qquad \frac{N_T}{N}\to p_T\qquad . $$
$$ \log_2\left(p_H^{N_H}p_T^{N_T}\right)= N_H\log_2 p_H + N_T\log_2 p_T = -N H[p_H, p_T]. $$
A way to quantify information in a signal
If the coin is really biased, you will be surprised when you get tails
Entropy lower than for fair coin, which has maximum entropy $H=1$
HHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHTHHHHT
To describe such a sequence, you might say “21 H, 13 H, 4 H”
Shorter than the original sequence; possible because of the high degree of predictability
But: extra symbols including the digits 0-9 and comma.
Should instead compare with a binary code of only two symbols
How can we exploit the lower entropy of the sequence?
N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.
Shannon’s theorem is the core idea that underlies (lossless) data compression
The more predictable a signal (i.e. the lower the entropy) the more it can be compressed, with the entropy setting a fundamental limit on the number of bits required.
We need some way of talking about the degree to which two distributions differ
Most common measure in use in ML is the Kullback–Leibler divergence (KL)
$$ D_\text{KL}(p||q)=\sum_x p(x)\log\left(\frac{p(x)}{q(x)}\right)=\E_{x\sim p}\log\left(\frac{p(x)}{q(x)}\right). $$
$$ D_\text{KL}(p||q)\geq 0 $$
$$ \E\left[\varphi(x))\right]\geq \varphi\left(\E\left[x\right]\right) $$
$$ p(z|x) = \frac{p(x|z)p(z)}{p(x)}=\frac{p(x,z)}{p(x)}. $$
$$ p(\sigma) = \frac{\exp\left[-\beta\cE(\sigma)\right]}{Z}. $$
$$ q_\theta(\sigma)=\prod_n q_{\theta_n}(\sigma_n). $$
$$ D_\text{KL}(q||p)(q||p)=\E_{\sigma\sim q_\theta}\left[\log\left(\frac{q_\theta(\sigma)}{p(\sigma)}\right)\right]. $$
Substituting in the Boltzmann distribution $$ D_\text{KL}(q||p)(q||p)= \log Z - H[q_\theta] + \beta \E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]\geq 0, $$ or in usual SM language $$ \E_{\sigma\sim q_\theta}\left[\cE(\sigma)\right]-TH[q_\theta] \geq F, $$ where $F=-T\log Z$ is the Helmholtz free energy.
This is the Bogoliubov or Gibbs inequality
Just need to replace the Boltzmann distribution with $$ p(z|x) =\frac{p(x,z)}{p_\text{M}(x)}. $$ (we add the subscript “M” for model)
Role of spins $\sigma$ is now played by the latent variables
Following same steps leads us to
$$ \log p_\text{M}(x) \geq \E_{z\sim q_\theta(\cdot|x)}\left[\log p(x,z)\right]+ H[q_\theta(\cdot|z)]. $$
RHS is Evidence lower bound or ELBO (marginalized probability $p(x)$ on the left is sometimes called the model evidence).
Possible to rewrite as $$ \log p_\text{M}(x) \geq \log p_\text{M}(x) - D_\text{KL}(q_\theta(\cdot|x)||p(\cdot|x)), $$ so the bound is saturated when the variational posterior for the latent variables coincides with the true posterior $$ p(z|x)=p(x,z)/p_\text{M}(x) $$