$$\nonumber \newcommand{\br}{\mathbf{r}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bk}{\mathbf{k}} \newcommand{\bq}{\mathbf{q}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bz}{\mathbf{z}}$$

## Outline

• Hamiltonians in physics
• ML background
• Learning dynamics
• Learning canonical transformations

## Hamiltonians in Physics

### Hamilton’s Equations

• Newton’s law 2nd order in $\mathbb{R}^N$

\begin{aligned} \dot \bq &=\frac{\partial H}{\partial \bp}\\ \dot \bp &=-\frac{\partial H}{\partial \bq} \end{aligned}

• 1st order equations in phase space $\mathbb{R}^{2N}$
• Flow along vector field $\bv(\bq,\bp)=(\dot\bq,\dot\bp)$.
• Trajectories are integral curves of $\bv(\bq,\bp)$.

### Simple Harmonic Motion

• Newton tells us $\ddot x = - \omega^2 x$

• General solution $x(t) = A\cos \omega t + B \sin \omega t$

### Phase Space

• One 2nd order equation $\longrightarrow$ two 1st order equations

\begin{aligned} \dot x &= p\\ \dot p &= -x \end{aligned}

• Observe energy $E=\frac{1}{2}\left[p^2+x^2\right]$ constant

### Pendulum

• Newton says

$$\ddot \theta = -\frac{g}{l}\sin\theta$$

### Pendulum in Phase Space

• $\theta\to x$

\begin{aligned} \dot x &= p\\ \dot p &= -\sin x \end{aligned}

### Hamilton’s equations

• For $(\mathbf{q}, \mathbf{p})\in\mathbb{R}^{2n}$, Hamiltonian $H:\mathbb{R}^{2n}\to \mathbb{R}$ defines dynamics via

\begin{aligned} \dot \bq &= \frac{\partial H}{\partial \bp}\\ \dot \bp &= -\frac{\partial H}{\partial \bq} \end{aligned}

• The examples we’ve seen so far have

$$H(\mathbf{q},\mathbf{p}) = \frac{1}{2}\mathbf{p}^2 + V(\mathbf{q})$$

• And $H$ is the energy

### Geometrical Meaning

• Phase plane velocity $(\dot\bq,\dot\bp)$ perpendicular to $\nabla H$ ($H$ conserved)

$$(\dot\bq,\dot\bp)\cdot (\nabla_\bq H, \nabla_\bp H) = (\nabla_\bp H,-\nabla_\bq H)\cdot (\nabla_\bq H, \nabla_\bp H) = 0$$

• Velocity is divergenceless

$$\nabla\cdot\bv=\frac{\partial \bv_\bq}{\partial \bq}+\frac{\partial \bv_\bp}{\partial \bp}=\frac{\partial^2 H}{\partial \bq\partial \bp}-\frac{\partial^2 H}{\partial \bp\partial \bq} = 0$$

• Flow preserves volume

### Chaos

• For $N=1$ motion on the contours of $H$ fixes trajectories

• For $N\geq 2$ visualize by taking a PoincarĂ© section

• Generically regions of regular and chaotic motion will occur

### Canonical Transformations

• Write $x=(\mathbf{q}, \mathbf{p})\in \mathbb{R}^{2N}$

• Hamiltonian evolution on $t\in[0,T]$ with $x’=x(T)$, $x=x(0)$

• Example of canonical (symplectic) transformation. What’s special?

### Hamilton’s equations

\begin{aligned} \dot \bq &= \frac{\partial H}{\partial \bp}\\ \dot \bp &= -\frac{\partial H}{\partial \bq}\\ \dot{x} &= \Omega \nabla_x H\, ,\quad \Omega = \begin{pmatrix} 0 & \mathbb{1}_n\\ -\mathbb{1}_n & 0 \end{pmatrix} \end{aligned}

• Hamilton’s equations $\longrightarrow$ Jacobian $J=\partial x’/\partial x$ satisfies

$$J^T\Omega J_f = \Omega$$

• $J(x)\in\text{Sp}_{2n}(\mathbb{R})$ is member of linear symplectic group

• Simplest case $n=1$

$$J = \begin{pmatrix} a & b \\ c & d \end{pmatrix}\longrightarrow ad-bc=1$$

• $J\in SL(2,\mathbb{R})$. Rotate / shear / squeeze

$$J^T\Omega J_f = \Omega$$

• Since $\det(J_f) = +1$ $\longrightarrow$ volume is conserved

• More: the sum of (signed) areas in each $q_j-p_j$ plane is preserved

• Canonical transformations preserve form of Hamilton’s equations

## ML background

### Generative Models

Sample from a multivariate distribution

### Default approach in physics

$$P(\{\sigma_i\}) = Z^{-1} \exp\left(-\beta\sum_{i,j}J_{ij}\sigma_i\sigma_j\right)$$

Use MCMC to sample from distribution, calculate expectations, etc.

### The Idea (Rezende & Mohamed, 2015)

• Given bijection $\mathbf{f}:\mathbf{x}\mapsto\mathbf{z}$ we have $$p_\mathbf{X}(\mathbf{x}) = p_{\mathbf{Z}}(f(\mathbf{x}))\left|\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right|$$
• Rich $\mathbf{f}$ can yield complex $p_\mathbf{X}$ from simple $p_{\mathbf{Z}}$ (e.g. Gaussian)
• Parameterize $\mathbf{f}$ with (deep) neural network
• Compose many bijectors $\mathbf{f}_L\circ \mathbf{f}_{L-1}\cdots \circ \mathbf{f}_1$
• Train by maximizing log-likelihood of data

### Application 1: Generation

• Sample: $\mathbf{x}=\mathbf{f}^{-1}(\mathbf{z})$ for $\mathbf{z}\sim p_{\mathbf{Z}}$ (requires invertible $\mathbf{f}$)

### Application 2: Density Estimation

• Given $\mathbf{x}$ find $\log p_X(\mathbf{x})$

### The Challenge

$$p_\mathbf{X}(\mathbf{x}) = p_{\mathbf{Z}}(f(\mathbf{x}))\left|\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right|$$

• For $\mathbf{x},\mathbf{x}\in\mathbb{R}^D$ computation of determinant is $O(D^3)$
• Prohibitive for training of large models.

### Two Solutions

1. Choose $\mathbf{f}_j$ to have tractable jacobian
2. Continuous limit

$$\mathbf{f}(\mathbf{x}) = \mathbf{x}+\epsilon \mathbf{g}(\mathbf{x})$$ $$\left|\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right|\sim 1 +\epsilon \textrm{tr}\left[\frac{\partial \mathbf{g}}{\partial \mathbf{x}}\right]$$

(Chen _et al._, [arXiv:1806.07366](https://arxiv.org/abs/1806.07366) )
• Let’s look at some examples of the first approach

### Example 1: Real NVP (Dinh et al, 2016)

• Divide the variables $\mathbf{x}$ into two groups $x_{1:d}$ and $x_{d+1:D}$

\begin{aligned} z_j &= x_j e^{\alpha_j(x_{d+1:D})} + \mu_j(x_{d+1:D}), \qquad j=1,\ldots, d \\ z_j &= x_j \qquad j=d+1,\ldots, D\\ \left|\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right| &= \prod_{j=1}^d e^{\alpha_i(x_{d+1:D})} \end{aligned}

• Parameterize scale $e^{\alpha_j(x_{d+1:D})}$ and shift $\mu_j(x_{d+1:D})$ by NN

• Compose many bijections

1. Alternating between two sets of variables, or
2. Linear orthogonal transformations (c.f. Glow)

### Example 2: Autoregressive models

• Exploit chain rule of probability: $p(\mathbf{x}) = \prod_j p(x_j|x_{1:x_{j-1}})$

$$x_j = z_j e^{\alpha_j(x_{1:j-1})} + \mu_j(x_{1:j-1})$$

## Learn dynamics

### Loss function

$$\operatorname*{argmin}_\theta \bigg \Vert \frac{d\mathbf{q}}{dt} - \frac{\partial \mathcal{H_{\theta}}}{\partial \mathbf{p}} \bigg \Vert^2 + \bigg \Vert \frac{d\mathbf{p}}{dt} + \frac{\partial \mathcal{H_{\theta}}}{\partial \mathbf{q}} \bigg \Vert^2$$

• $\mathcal{H_{\theta}}(\bq,\bp)$ parameterized by NN; gradients from AD

### What if you just had an image of the motion?

• Map pairs of images $\left[\bx_{t}, \bx_{t+1} \right]$ to a “latent” phase space $\bz=\left[q_t,p_t\right]$
• Correct dimensionality assumed known

• Toth et al. arXiv:1909.13789 refinement: model the initial point in phase space by an encoder network $\bz\sim q(\cdot|\bx_0,\ldots,\bx_T)$

## Neural Canonical Transformations

• Learning Symmetries of Classical Integrable Systems, Roberto Bondesan & AL arXiv:1906.04645
• Neural Canonical Transformation with Symplectic Flows, Li et al. arXiv:1910.00024

### Generating Functions

• ‘Type 2’ canonical transformation

$$\mathbf{q}'=\mathbf{q}, \qquad \mathbf{p}'=\mathbf{p}-\nabla F(\mathbf{q})$$

• c.f. leapfrog, real NVP

$$z_j = x_j e^{\alpha_j(x_{d+1:D})} + \mu_j(x_{d+1:D}), \qquad j=1,\ldots, d$$

• Differences:
1. Only canonical if $\partial_j \mu_k = \partial_k \mu_j$
2. No scale (‘NICE’)

### Parameterizing shift

1. $F(\mathbf{q})\sim \textsf{NN}(\mathbf{q})$ and get $\nabla F(\mathbf{q})$ from autodiff
__Problem__: $O(m^2)$ in training network of $m$ layers.

1. Irrotational MLP: $\partial_j \mu_k = \partial_k \mu_j$ if

$$\mu(\mathbf{q}) = W_1^T\sigma(W_2 \sigma(W_1\mathbf{q})), \qquad W_2 \text{ diagonal}$$

• $\mathbf{q}$ unchanged. More complicated transformations?

### Linear layers

• Linear layer to mix $p,q$ so that deeper additive couplings act on all phase space coordinates (c.f. Glow)

• To parametrize $S\in \text{Sp}_{2n}(\mathbb{R})$ \begin{aligned} S = NAK= \begin{pmatrix} \mathbf{1} & 0 \\ M & \mathbf{1} \end{pmatrix} \begin{pmatrix} L^\top & 0 \\ 0 & L^{-1} \end{pmatrix} \begin{pmatrix} X & -Y \\ Y & X \end{pmatrix} \, , \end{aligned} with \begin{aligned} &M = M^\top \,,\quad &X^\top Y = Y^\top X \, ,\quad X^\top X + Y^\top Y = \mathbf{1}\,. \end{aligned}

$$K = \begin{pmatrix} X & -Y \ Y & X \end{pmatrix} , ,$$

• Write in terms of unitary $X + iY$ as product of Householder reflections $$R_v = \mathbb{1} - 2 \frac{v v^\dagger}{||v||^2}\in {\rm U}_n,,$$ and a diagonal matrix of phases

$$U = \text{diag}(e^{i\phi_i}),.$$

### Zero Center (c.f. Batch Norm)

\begin{aligned} \begin{cases} Q = q - \mu^q + \alpha\ P = p - \mu^p + \beta \end{cases}, , \end{aligned}

• Training: $\mu$ is batch mean during training

• Testing: weighted moving average accumulated during training

• Full version of batch norm not canonical

### Liouville-Arnold Theorem

• Integrable means $N$ conserved phase space functions $I_j$:

• Canonical transformations generated by each (as Hamiltonian) commute

• Submanifold of phase space at fixed $\left\{I_{i}\right\}$ is $N$-Torus $\mathbb{T}^{N}$

### Transformed Hamiltonian

• Transformation $\mathcal{T}$ to action-angle coordinates,

\begin{aligned} \dot{\varphi} = \partial_IK = \text{const.}\,,\quad \dot{I} = -\partial_\varphi K = 0 \end{aligned}

• The transformed Hamiltonian $K = H \circ \mathcal{T}$ is independent of the angles

### Example: Kepler Hamiltonian

$$H_{\text{K}} = \tfrac{1}{2}\sum p_i^2 + \frac{k}{r},, \quad r = \sqrt{\sum q_i^2}.$$

• $H_\text{K}$ and angular momentum $\mathbf{L}=\mathbf{q}\times\mathbf{p}$ conserved

• Additional conserved quantity Laplace–Runge–Lenz vector $$\mathbf{A} = \mathbf{p}\times\mathbf{L} + k \frac{\mathbf{q}}{r}.$$

• Total of 7 conserved quantities. But phase space only 6D!

• Two relations

\begin{aligned} \mathbf{A}\cdot \mathbf{L}&=0\qquad \mathbf{A}^2 &= k^2 + 2H_\text{K} \mathbf{L}^2 \end{aligned}

• Trajectories close. This is called superintegrability

### Loss function

• Learn the canonical transformation $T$

$$T : (\hat{q}, \hat{p}) \mapsto (q, p) ,.$$

\begin{aligned} \ell = \frac{1}{n \tau} \sum_{k=1}^{\tau} || r_{k} - r_{k+1} ||^2\,,\quad r_{k} = \hat{q}(t_k)^2 + \hat{p}(t_k)^2 \,. \end{aligned}

• Not canonical: could minimize loss by collapsing the trajectories
1. Find trajectories from equations of motions (RK)

3. Shuffle the trajectories at every epoch

### Outlook

• Continuous canonical transformations (c.f. neural ODE)

• Convolutional bijectors, identical particles. etc.

• See Neural Canonical Transformation with Symplectic Flows, Li et al. arXiv:1910.00024 for recent developments

### Fermi–Pasta–Ulam–Tsingou problem

$$H = \sum_{i=1}^N \frac{1}{2} [p_i^2 + (q_{i} - q_{i+1})^2] + \frac{\alpha}{3} (q_{i} - q_{i+1})^3 + \frac{\beta}{4} (q_{i} - q_{i+1})^4$$`

### Transform Normal Modes

• $N=5$ sites, 4 nonzero modes

â€‹