Standard Recurrent Neural Networks (RNNs) and their variants, such as LSTMs and GRUs, operate on discrete sequences of data. They update their hidden state at fixed intervals, which can be restrictive for modeling continuous-time processes or data with irregular time steps. Neural Ordinary Differential Equations (Neural ODEs) propose a paradigm shift: instead of defining a discrete transition function, we model the continuous-time dynamics of a system's hidden state, $\mathbf{z}(t)$, using a neural network.
The core idea is to parameterize the derivative of the hidden state with respect to time, $\frac{d\mathbf{z}(t)}{dt}$, using a neural network, denoted as $f_\theta$. This transforms the problem of learning a sequence-to-sequence mapping into learning the vector field of a dynamical system.
Let the state of a system at time $t$ be represented by a vector $\mathbf{z}(t) \in \mathbb{R}^D$. A Neural ODE defines the dynamics of this state via the following initial value problem (IVP):
$$
\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t)
$$
where $f_\theta$ is a neural network with parameters $\theta$. Given an initial state $\mathbf{z}(t_0)$, the state at any later time $t_1$ can be found by solving this ODE:
$$
\mathbf{z}(t_1) = \mathbf{z}(t_0) + \int_{t_0}^{t_1} f_\theta(\mathbf{z}(t), t) dt
$$
This integration is performed by a numerical ODE solver, such as Euler's method or the more accurate Runge-Kutta methods (e.g., RK4). The entire process, from $\mathbf{z}(t_0)$ to $\mathbf{z}(t_1)$, can be viewed as a single, continuous-depth residual network layer, which we can call an ODESolve
function.
Training the network—that is, finding the optimal parameters $\theta$—requires backpropagating through the ODE solver. A naive approach of unrolling the solver's operations and storing all intermediate states is computationally expensive, especially for long time horizons or high-precision solvers.
The Adjoint Method provides an elegant and memory-efficient solution. It avoids the need for backpropagation through the solver's internal steps. We define a loss function $L$ that depends on the final state, $L(\mathbf{z}(t_1))$. The key is to compute the gradient of the loss with respect to the parameters, $\frac{dL}{d\theta}$, without explicitly differentiating the ODESolve
function.
The adjoint state, $\mathbf{a}(t) = \frac{dL}{d\mathbf{z}(t)}$, represents how the loss changes with respect to the hidden state at time $t$. Its dynamics are given by another ODE: $$ \frac{d\mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f_\theta(\mathbf{z}(t), t)}{\partial \mathbf{z}} $$ This adjoint ODE is solved backward in time, from $t_1$ to $t_0$, starting with the initial condition $\mathbf{a}(t_1) = \frac{\partial L}{\partial \mathbf{z}(t_1)}$. The gradient of the loss with respect to the parameters $\theta$ can then be computed by evaluating a third integral, also backward in time: $$ \frac{dL}{d\theta} = \int_{t_1}^{t_0} \mathbf{a}(t)^T \frac{\partial f_\theta(\mathbf{z}(t), t)}{\partial \theta} dt $$ This method has a constant memory cost with respect to the number of integration steps, making it highly scalable. This interactive demo allows you to switch between standard backpropagation (unrolling) and the more efficient continuous adjoint method.
This tool allows you to explore the behavior of Neural ODEs on various dynamical systems.