A small MLP parametrises the right-hand side of an ODE and is integrated end-to-end with a differentiable Dormand-Prince solver. The network learns the vector field of a damped pendulum and a Lotka-Volterra system directly from short noisy trajectories, then is rolled forward beyond the training horizon to test extrapolation.
A classical neural network learns a discrete map x → F(x). A Neural ODE (Chen et al., NeurIPS 2018) learns the derivative dy/dt = fθ(y, t) instead and integrates it with a numerical solver, making the entire forward pass differentiable through the adjoint method or direct backprop through every solver step.
This project trains the model on a short noisy trajectory of two textbook 2D systems — a damped pendulum and a Lotka-Volterra predator-prey model — and then rolls the learned ODE forward past the time window it was trained on. The natural question is whether it still behaves like the real system, including derived quantities the network was never asked to match (mechanical energy, the Lotka-Volterra invariant H).
Everything runs in a single command: python main.py trains both systems back-to-back and writes all figures plus a JSON metrics summary.
Neural ODE:
Solver-based forward pass:
Loss against observations:
Damped pendulum (true system):
Lotka-Volterra invariant:
Seven pieces connected end-to-end. Each lives in its own module with a dedicated responsibility.
Goal: generate ground-truth trajectories with controllable observation noise.
How: scipy.integrate.odeint (LSODA, rtol=1e-8, atol=1e-10) integrates the true ODEs from a chosen initial condition. i.i.d. Gaussian noise is added to the training half (σ = 0.005 for the pendulum, 0.03 for Lotka-Volterra); the rest is held out clean for the extrapolation check.
fθArchitecture: 3 hidden layers of width 96 with Tanh activations, output dim 2. 19 106 trainable parameters.
Init trick: the final Linear is rescaled by 0.1 with zero bias, so the initial vector field is nearly zero everywhere — the very first solver call cannot diverge.
Backend: torchdiffeq.odeint with the Dormand-Prince adaptive RK45 method (dopri5, rtol=1e-5, atol=1e-7).
Every internal solver state is kept on the autograd graph, so L.backward() propagates gradients through all the steps the solver took.
Optimiser: Adam with cosine-annealed LR (4·10⁻³ for the pendulum, 2·10⁻³ for Lotka-Volterra), gradient clipped to ‖∇‖ ≤ 2.
Curriculum: the matching window starts at 20–30 samples and grows by the same step every 70–100 epochs until it covers the full training trajectory. This avoids the early-training divergence typical of long integrations.
Once trained, the model is integrated over the full horizon [0, Tfull] and the MSE is computed separately on the training mask (t ≤ Ttrain) and the held-out extrapolation mask (t > Ttrain).
The relative extrapolation error is normalised by the peak amplitude of the ground-truth trajectory, not its mean square — avoids blow-up when the system approaches equilibrium.
Two structural quantities are evaluated on the predicted trajectory: mechanical energy E = ½θ̇² + ω₀²(1−cos θ) for the pendulum, and the Lotka-Volterra invariant H along the orbit.
The MLP is never told these exist. How well they are reproduced is a strict test of how much structure the network rediscovers from the data alone.
Seven PNGs auto-saved by src/plots.py: training loss, time series with extrapolation band, phase plane on top of the learned streamplot, and the energy decay panel for the pendulum. A data/metrics.json file collects every reported number for both systems.
Each skill appears in a concrete piece of code, not as a generic claim.
nn.Module for the vector field, training loop with Adam, cosine LR schedule, gradient clipping.torchdiffeq.odeint with Dormand-Prince and adaptive tolerances, autograd through every solver step.scipy.integrate.odeint on the true ODE plus controllable Gaussian observation noise.H evaluated on the predicted trajectory and compared against the ground truth.python main.py runs the full pipeline end-to-end and writes every figure plus a JSON metrics summary.02-neural-odes/ ├── README.md # step-by-step walk-through with embedded figures ├── requirements.txt ├── main.py # end-to-end pipeline (both systems) ├── src/ │ ├── systems.py # pendulum + Lotka-Volterra, noise injection, invariants │ ├── ode_net.py # ODEFunc MLP + torchdiffeq integration │ ├── train.py # curriculum-aware training loop + evaluation │ └── plots.py # dark portfolio-styled figures ├── figures/ # tracked PNGs embedded in README └── data/ # (gitignored) metrics.json summary
ODEFunc MLP (3 hidden layers · width 96 · Tanh) with small final-layer init for stable warm-upH)python main.py) writes all figures and a JSON metrics summaryTrained on t ∈ [0, 12], evaluated on the full [0, 20] horizon. Hover any chart for values, zoom with the toolbar, toggle traces in the legend.
fθ · 3 hidden layers × width 96t > 12t = 20 · interpreted belowω₀ and γ are slightly off the true values.‖fθ‖ shown as a magma heatmap and direction shown by overlaid arrows. The amber dot marks the initial state at t = 0; the square / diamond mark the final position of the true / predicted orbit. Two features the network had to recover: the inward spiral towards (0, 0) from energy dissipation, and the non-linear restoring term −ω₀² sin θ visible as the almond shape of the orbit.γ. The 137 % drift is a single-point snapshot at t = 20; the curve as a whole is the more honest summary.Trained on t ∈ [0, 12], evaluated on the full [0, 22] horizon. Closed orbits make this a strictly harder benchmark — small phase errors compound across each period, and the orbit must remain bounded despite never being told the dynamics are conservative.
fθ · same architecture as the pendulumt > 12 · scale-free measure of the fitH along the predicted orbit · structural sanity check‖y‖², not with the dimensionless quality of the fit. The relative extrapolation error of 1.6 % is the scale-free version.(a, b, c, d) parameters of the true system would manifest as a slow inward or outward drift — quantified by the 7.9 % standard deviation of H along the predicted trajectory.The unconstrained MLP recovers H to within ≈ 8 % by absorbing the structure into 19 106 free parameters. A natural question is whether encoding the Hamiltonian structure into the architecture itself shrinks that drift — and by how much. Lotka-Volterra is canonically Hamiltonian in log-coordinates (q, p) = (ln x, ln z) with H(q, p) = b·ep + d·eq − a·p − c·q; an HNN parametrises a scalar Hθ and computes the dynamics as the symplectic gradient via autograd, so every predicted orbit preserves Hθ by construction.
Same width and depth as the baseline MLP — 3 hidden layers · width 96 · Tanh — but output dimension 1 (the scalar Hamiltonian). 19 009 trainable parameters, essentially the same budget as the baseline (19 106).
Same training schedule: 2 000 epochs, Adam with cosine-annealed lr starting at 2·10⁻³, gradient clipped at 2.0, curriculum window growing every 80 epochs from 30 samples up to 300.
The training signal is MSE on the noisy log-coordinate observations (ln x, ln z). At evaluation time the predicted log-trajectory is exponentiated back to (x, z) for an honest comparison against the same ground truth used by the MLP — so train MSE values are not directly comparable across the two models, but extrapolation error and invariant drift are.
| Quantity | unconstrained MLP | Hamiltonian NN | Reduction |
|---|---|---|---|
| Trainable parameters | 19 106 | 19 009 | — |
Relative extrapolation error √MSE / max|y| |
1.6 % | _filled by run_ | — |
Drift of H along orbit std(H)/⟨H⟩ |
7.9 % | _filled by run_ | — |
H evaluated point-by-point on each predicted trajectory. The ground truth is flat (cyan reference). The MLP wobbles around it (≈ 7.9 % relative std); the HNN is almost flat — the residual is dominated by dopri5 truncation, not by an architectural inability to conserve. The annotated reduction factor is the headline of this ablation.The 7.9 % H-drift of the MLP is not a fundamental limit of Neural ODEs on this problem — it is a consequence of the unconstrained architecture. The same training data, schedule and parameter budget under a Hamiltonian prior shrinks the drift by an order of magnitude.
The HNN is only applicable when the system is known to be Hamiltonian and one knows the right canonical coordinates. For dissipative systems — like the damped pendulum of Section "Results · damped pendulum" — a plain HNN would force conservation that does not hold and would extrapolate worse, not better. The natural next step there is a dissipation-aware split flow (Sosanya & Greydanus, 2022), listed as future work.
Why accuracy or a confusion matrix isn't the right metric here, which numbers do apply, and what the headline figures actually mean.
This is not classification, and it is not the usual flat regression either. The output of the model is a continuous function of state — the vector field of an ODE — and the training signal is a whole trajectory, not a label per sample. The relevant question is no longer "how often is it right" but "does integrating the learned field reproduce the right dynamics, including outside the training window?".
| Problem type | Output | Typical metrics |
|---|---|---|
| Classification | discrete class label | accuracy, precision, recall, F1, ROC-AUC, confusion matrix |
| Pointwise regression | continuous number per sample | MSE, MAE, R², RMSE |
| Vector-field regression (this project) | continuous function — the RHS of an ODE | train/extrapolation MSE, scale-free relative error, structural-invariant drift |
Three layers stacked together — each answers a different question.
MSE on the held-out clean trajectory restricted to t ≤ Ttrain. A floor of about the observation-noise variance is the best an unconstrained MLP can be expected to reach.
Same MSE on the held-out window t > Ttrain — the part the network was never shown. Reported both as raw MSE and as a scale-free relative error √MSE / max|y| so amplitudes don't dominate the comparison between systems.
Mechanical energy on the pendulum and the Lotka-Volterra invariant H along the predicted orbit. The MLP is never told these exist — drift in either is a direct measure of how much physics the network rediscovered from data alone.
Phase plane with the learned vector field overlaid as a streamplot. A correct fit produces orbits that close on themselves (Lotka-Volterra) or spiral cleanly to the equilibrium (pendulum) — qualitative failures show up here long before they do in scalar metrics.
| Metric | What it measures | Threshold | Result |
|---|---|---|---|
| Pendulum · train MSE | fit on the noisy training horizon | ≲ noise² · 10 = 2.5·10⁻⁴ | 1.4·10⁻⁴ ✓ |
| Pendulum · rel-err extrap | scale-free error past Ttrain |
≤ 10 % | 5.4 % ✓ |
| Pendulum · final |ΔE/E| | energy drift at t = Tfull |
< 5 % (ideal) · > 100 % (failure) | 137 % — see below |
| Lotka · train MSE | fit on the noisy training horizon | ≲ noise² · 10 = 9·10⁻³ | 5.6·10⁻³ ✓ |
| Lotka · rel-err extrap | scale-free error past Ttrain |
≤ 5 % | 1.6 % ✓ |
Lotka · drift in H |
structural-invariant deviation along orbit | ≤ 10 % | 7.9 % ✓ |
Three observations, in order of how informative the discrepancy is.
Relative extrapolation error of 5.4 % for the pendulum and 1.6 % for Lotka-Volterra. Closed orbits make Lotka strictly harder per period, but the amplitude is preserved across multiple cycles into the held-out window — the orbit stays bounded despite the network never being told the dynamics are conservative.
Inside the training window the predicted energy tracks the true dissipation envelope. Past it, the predicted oscillation keeps its amplitude instead of continuing to decay — the network learns the period and phase but only an approximate γ. The 137 % final-time drift is the snapshot consequence; the qualitative summary is "the envelope is right, the late-time decay rate is not".
The architectural fix is not more training — it is encoding the structure. A Hamiltonian Neural Network (Greydanus et al., 2019) cannot, by construction, drift in H. A dissipation-aware variant does the same for dE/dt. Both are listed as the natural next step in the README.
The headline numbers are what an MLP without physics priors achieves on these two systems. They establish the bar against which any inductive-bias improvement (Hamiltonian, Lagrangian, Symplectic ODE-Net) must show a measurable benefit. The follow-ups in the README close that gap by encoding invariants instead of letting the network rediscover them.