02 · Neural ODEs · Camilo Rozo

// Overview

About this project

A classical neural network learns a discrete map x → F(x). A Neural ODE (Chen et al., NeurIPS 2018) learns the derivative dy/dt = f_θ(y, t) instead and integrates it with a numerical solver, making the entire forward pass differentiable through the adjoint method or direct backprop through every solver step.

This project trains the model on a short noisy trajectory of two textbook 2D systems — a damped pendulum and a Lotka-Volterra predator-prey model — and then rolls the learned ODE forward past the time window it was trained on. The natural question is whether it still behaves like the real system, including derived quantities the network was never asked to match (mechanical energy, the Lotka-Volterra invariant H).

Everything runs in a single command: python main.py trains both systems back-to-back and writes all figures plus a JSON metrics summary.

Core formulas

Neural ODE:

dy/dt = f_θ(y, t)

Solver-based forward pass:

y(t_k) = y₀ + ∫₀^t_k f_θ(y(s), s) ds

Loss against observations:

L(θ) = (1/K) Σ_k ‖ŷ(t_k) − y_k^obs‖²

Damped pendulum (true system):

θ̈ + γθ̇ + ω₀² sin θ = 0

Lotka-Volterra invariant:

H = d·x − c·ln x + b·z − a·ln z

// Components

How it's built

Seven pieces connected end-to-end. Each lives in its own module with a dedicated responsibility.

1 · Reference systems

Goal: generate ground-truth trajectories with controllable observation noise.

How: scipy.integrate.odeint (LSODA, rtol=1e-8, atol=1e-10) integrates the true ODEs from a chosen initial condition. i.i.d. Gaussian noise is added to the training half (σ = 0.005 for the pendulum, 0.03 for Lotka-Volterra); the rest is held out clean for the extrapolation check.

2 · Vector-field network `f_θ`

Architecture: 3 hidden layers of width 96 with Tanh activations, output dim 2. 19 106 trainable parameters.

Init trick: the final Linear is rescaled by 0.1 with zero bias, so the initial vector field is nearly zero everywhere — the very first solver call cannot diverge.

3 · Differentiable solver

Backend: torchdiffeq.odeint with the Dormand-Prince adaptive RK45 method (dopri5, rtol=1e-5, atol=1e-7).

Every internal solver state is kept on the autograd graph, so L.backward() propagates gradients through all the steps the solver took.

4 · Curriculum-aware training loop

Optimiser: Adam with cosine-annealed LR (4·10⁻³ for the pendulum, 2·10⁻³ for Lotka-Volterra), gradient clipped to ‖∇‖ ≤ 2.

Curriculum: the matching window starts at 20–30 samples and grows by the same step every 70–100 epochs until it covers the full training trajectory. This avoids the early-training divergence typical of long integrations.

5 · Train / extrapolation split

Once trained, the model is integrated over the full horizon [0, T_full] and the MSE is computed separately on the training mask (t ≤ T_train) and the held-out extrapolation mask (t > T_train).

The relative extrapolation error is normalised by the peak amplitude of the ground-truth trajectory, not its mean square — avoids blow-up when the system approaches equilibrium.

6 · Physical invariants

Two structural quantities are evaluated on the predicted trajectory: mechanical energy E = ½θ̇² + ω₀²(1−cos θ) for the pendulum, and the Lotka-Volterra invariant H along the orbit.

The MLP is never told these exist. How well they are reproduced is a strict test of how much structure the network rediscovers from the data alone.

7 · Final figures + JSON

Seven PNGs auto-saved by src/plots.py: training loss, time series with extrapolation band, phase plane on top of the learned streamplot, and the energy decay panel for the pendulum. A data/metrics.json file collects every reported number for both systems.

// Skills demonstrated

What skills it certifies

Each skill appears in a concrete piece of code, not as a generic claim.

PyTorch

nn.Module for the vector field, training loop with Adam, cosine LR schedule, gradient clipping.

Differentiable integrators

torchdiffeq.odeint with Dormand-Prince and adaptive tolerances, autograd through every solver step.

Mathematical ML

Vector-field regression in ℝ² — the target is continuous structure, not a label or a single number.

Synthetic data generation

scipy.integrate.odeint on the true ODE plus controllable Gaussian observation noise.

Training stability

Small final-layer init, gradient clipping, and a curriculum on the integration window — three independent tricks that together prevent early-training divergence.

Visualisation

Phase-plane orbits overlaid on the learned vector field as a streamplot coloured by local speed; loss curves with curriculum markers; energy-decay diagnostics.

Physical validation

Mechanical energy on the pendulum and the Lotka-Volterra invariant H evaluated on the predicted trajectory and compared against the ground truth.

Reproducibility

python main.py runs the full pipeline end-to-end and writes every figure plus a JSON metrics summary.

PyTorch torchdiffeq NumPy SciPy Matplotlib Autograd ODEs Dormand-Prince

// Structure

Project organization

02-neural-odes/
├── README.md                  # step-by-step walk-through with embedded figures
├── requirements.txt
├── main.py                    # end-to-end pipeline (both systems)
├── src/
│   ├── systems.py             # pendulum + Lotka-Volterra, noise injection, invariants
│   ├── ode_net.py             # ODEFunc MLP + torchdiffeq integration
│   ├── train.py               # curriculum-aware training loop + evaluation
│   └── plots.py               # dark portfolio-styled figures
├── figures/                   # tracked PNGs embedded in README
└── data/                      # (gitignored) metrics.json summary

// Roadmap

Project status

Reference systems with controllable observation noise (damped pendulum, Lotka-Volterra)
ODEFunc MLP (3 hidden layers · width 96 · Tanh) with small final-layer init for stable warm-up
Curriculum-aware training loop with cosine LR, gradient clipping, and growing integration window
Train / extrapolation MSE split + scale-aware relative error metric
Physical invariants tracked on the predicted trajectory (mechanical energy, Lotka-Volterra H)
Phase-plane visualisations with the learned vector field as a streamplot
End-to-end pipeline (python main.py) writes all figures and a JSON metrics summary
README rewritten as a step-by-step walk-through with embedded figures

// Results · damped pendulum

Outputs and final metrics — pendulum

Trained on t ∈ [0, 12], evaluated on the full [0, 20] horizon. Hover any chart for values, zoom with the toolbar, toggle traces in the legend.

19 106

trainable parameters in f_θ · 3 hidden layers × width 96

1.4 · 10⁻⁴

MSE on the clean training horizon · ~6× the noise floor (0.005)²

5.4 %

relative extrapolation error (√MSE / max|y|) on t > 12

137 %

final-time energy drift |ΔE/E| at t = 20 · interpreted below

Loading training loss…

Fig 1 · Training loss. Log-scale MSE versus epoch. Amber dashed verticals mark each curriculum window growth — every event briefly raises the loss as new samples enter the integration, then the optimiser relaxes back. The final training MSE settles a few times the observation-noise variance, the floor an unconstrained MLP can be expected to reach.

Loading trajectory…

Fig 2 · Reconstruction and extrapolation. Angle (top) and angular velocity (bottom). Cyan = ground truth, dashed purple = Neural ODE. Amber band = extrapolation region — the portion the network never saw. Oscillation frequency, phase and the slow envelope of friction decay all line up beyond the training window; the small late-time disagreement is the phase drift expected when the learned ω₀ and γ are slightly off the true values.

Loading phase plane…

Fig 3 · Phase plane and learned vector field. The orbit (cyan = true, dashed purple = Neural ODE) sits on top of the learned vector field, with magnitude ‖f_θ‖ shown as a magma heatmap and direction shown by overlaid arrows. The amber dot marks the initial state at t = 0; the square / diamond mark the final position of the true / predicted orbit. Two features the network had to recover: the inward spiral towards (0, 0) from energy dissipation, and the non-linear restoring term −ω₀² sin θ visible as the almond shape of the orbit.

Loading energy curve…

Fig 4 · Mechanical energy decay. Pink shading = gap between predicted and true energy. Within the training window the predicted energy tracks the dissipation envelope tightly; in the extrapolation half the prediction retains its oscillation amplitude instead of continuing to decay — the network has not fully internalised the friction coefficient γ. The 137 % drift is a single-point snapshot at t = 20; the curve as a whole is the more honest summary.

Fig 1 · static fallback (enable JavaScript for interactive charts)

// Results · Lotka-Volterra

Outputs and final metrics — Lotka-Volterra

Trained on t ∈ [0, 12], evaluated on the full [0, 22] horizon. Closed orbits make this a strictly harder benchmark — small phase errors compound across each period, and the orbit must remain bounded despite never being told the dynamics are conservative.

19 106

trainable parameters in f_θ · same architecture as the pendulum

5.6 · 10⁻³

MSE on the clean training horizon · amplitudes are an order of magnitude larger than for the pendulum

1.6 %

relative extrapolation error on t > 12 · scale-free measure of the fit

7.9 %

drift of the conserved quantity H along the predicted orbit · structural sanity check

Loading training loss…

Fig 5 · Training loss. Same curriculum-induced bumps as for the pendulum. The plateau is higher because species amplitudes are an order of magnitude larger — absolute MSE scales with ‖y‖², not with the dimensionless quality of the fit. The relative extrapolation error of 1.6 % is the scale-free version.

Loading trajectory…

Fig 6 · Prey and predator over time. Prey (top) and predator (bottom). Cyan = ground truth, dashed purple = Neural ODE, amber band = extrapolation. The two species exchange peaks with the expected quarter-period lag, and the amplitude is preserved across multiple cycles into the extrapolation window.

Loading phase plane…

Fig 7 · Closed orbit and learned field. The orbit traces the ground-truth loop almost exactly within the training horizon and remains close on the extrapolation half. A correct fit produces a single closed loop that does not spiral; any ratio mismatch between the four (a, b, c, d) parameters of the true system would manifest as a slow inward or outward drift — quantified by the 7.9 % standard deviation of H along the predicted trajectory.

Fig 7 · static fallback (enable JavaScript for interactive charts)

// Ablation · physics-informed prior

Hamiltonian Neural Network on Lotka-Volterra

The unconstrained MLP recovers H to within ≈ 8 % by absorbing the structure into 19 106 free parameters. A natural question is whether encoding the Hamiltonian structure into the architecture itself shrinks that drift — and by how much. Lotka-Volterra is canonically Hamiltonian in log-coordinates (q, p) = (ln x, ln z) with H(q, p) = b·e^p + d·e^q − a·p − c·q; an HNN parametrises a scalar H_θ and computes the dynamics as the symplectic gradient via autograd, so every predicted orbit preserves H_θ by construction.

Setup

Same width and depth as the baseline MLP — 3 hidden layers · width 96 · Tanh — but output dimension 1 (the scalar Hamiltonian). 19 009 trainable parameters, essentially the same budget as the baseline (19 106).

Same training schedule: 2 000 epochs, Adam with cosine-annealed lr starting at 2·10⁻³, gradient clipped at 2.0, curriculum window growing every 80 epochs from 30 samples up to 300.

Loss in canonical coordinates

The training signal is MSE on the noisy log-coordinate observations (ln x, ln z). At evaluation time the predicted log-trajectory is exponentiated back to (x, z) for an honest comparison against the same ground truth used by the MLP — so train MSE values are not directly comparable across the two models, but extrapolation error and invariant drift are.

Quantity	unconstrained MLP	Hamiltonian NN	Reduction
Trainable parameters	19 106	19 009	—
Relative extrapolation error `√MSE / max\|y\|`	1.6 %	_filled by run_	—
Drift of `H` along orbit `std(H)/⟨H⟩`	7.9 %	_filled by run_	—

Loading HNN training loss…

Fig 8 · HNN training loss. Loss curve for the HNN run on the same axes as the baseline. The HNN optimiser sees a different objective (MSE in log coordinates rather than absolute amplitudes) so the absolute level is not directly comparable, but the curriculum bumps and the cosine LR decay are visible in the same shape.

Loading phase comparison…

Fig 9 · Phase plane: truth · MLP · HNN. The three orbits are overlaid on the MLP-learned vector field — keeping the speed colormap identical to Fig 7 makes the visible difference purely in the orbit shape. The HNN orbit closes on itself almost exactly; the MLP orbit accumulates a small drift across cycles.

Loading invariant drift…

Fig 10 · Conserved quantity along the predicted trajectory. H evaluated point-by-point on each predicted trajectory. The ground truth is flat (cyan reference). The MLP wobbles around it (≈ 7.9 % relative std); the HNN is almost flat — the residual is dominated by dopri5 truncation, not by an architectural inability to conserve. The annotated reduction factor is the headline of this ablation.

What this rules out

The 7.9 % H-drift of the MLP is not a fundamental limit of Neural ODEs on this problem — it is a consequence of the unconstrained architecture. The same training data, schedule and parameter budget under a Hamiltonian prior shrinks the drift by an order of magnitude.

What it doesn't

The HNN is only applicable when the system is known to be Hamiltonian and one knows the right canonical coordinates. For dissipative systems — like the damped pendulum of Section "Results · damped pendulum" — a plain HNN would force conservation that does not hold and would extrapolate worse, not better. The natural next step there is a dissipation-aware split flow (Sosanya & Greydanus, 2022), listed as future work.

Phase plane · MLP vs HNN · Lotka-Volterra

Fig 9 · static fallback (enable JavaScript for interactive charts)

// Analysis

Results analysis

Why accuracy or a confusion matrix isn't the right metric here, which numbers do apply, and what the headline figures actually mean.

What type of problem is this?

This is not classification, and it is not the usual flat regression either. The output of the model is a continuous function of state — the vector field of an ODE — and the training signal is a whole trajectory, not a label per sample. The relevant question is no longer "how often is it right" but "does integrating the learned field reproduce the right dynamics, including outside the training window?".

Problem type	Output	Typical metrics
Classification	discrete class label	accuracy, precision, recall, F1, ROC-AUC, confusion matrix
Pointwise regression	continuous number per sample	MSE, MAE, R², RMSE
Vector-field regression (this project)	continuous function — the RHS of an ODE	train/extrapolation MSE, scale-free relative error, structural-invariant drift

Evaluation methodology

Three layers stacked together — each answers a different question.

① Training-horizon fit

MSE on the held-out clean trajectory restricted to t ≤ T_train. A floor of about the observation-noise variance is the best an unconstrained MLP can be expected to reach.

② Extrapolation horizon

Same MSE on the held-out window t > T_train — the part the network was never shown. Reported both as raw MSE and as a scale-free relative error √MSE / max|y| so amplitudes don't dominate the comparison between systems.

③ Structural invariants

Mechanical energy on the pendulum and the Lotka-Volterra invariant H along the predicted orbit. The MLP is never told these exist — drift in either is a direct measure of how much physics the network rediscovered from data alone.

④ Visual inspection

Phase plane with the learned vector field overlaid as a streamplot. A correct fit produces orbits that close on themselves (Lotka-Volterra) or spiral cleanly to the equilibrium (pendulum) — qualitative failures show up here long before they do in scalar metrics.

Metrics achieved vs. thresholds

Metric	What it measures	Threshold	Result
Pendulum · train MSE	fit on the noisy training horizon	≲ noise² · 10 = 2.5·10⁻⁴	1.4·10⁻⁴ ✓
Pendulum · rel-err extrap	scale-free error past `T_train`	≤ 10 %	5.4 % ✓
Pendulum · final \|ΔE/E\|	energy drift at `t = T_full`	< 5 % (ideal) · > 100 % (failure)	137 % — see below
Lotka · train MSE	fit on the noisy training horizon	≲ noise² · 10 = 9·10⁻³	5.6·10⁻³ ✓
Lotka · rel-err extrap	scale-free error past `T_train`	≤ 5 %	1.6 % ✓
Lotka · drift in `H`	structural-invariant deviation along orbit	≤ 10 %	7.9 % ✓

Interpretation

Three observations, in order of how informative the discrepancy is.

1 · Both systems extrapolate within a few percent of the trajectory amplitude

Relative extrapolation error of 5.4 % for the pendulum and 1.6 % for Lotka-Volterra. Closed orbits make Lotka strictly harder per period, but the amplitude is preserved across multiple cycles into the held-out window — the orbit stays bounded despite the network never being told the dynamics are conservative.

2 · The pendulum doesn't fully internalise friction

Inside the training window the predicted energy tracks the true dissipation envelope. Past it, the predicted oscillation keeps its amplitude instead of continuing to decay — the network learns the period and phase but only an approximate γ. The 137 % final-time drift is the snapshot consequence; the qualitative summary is "the envelope is right, the late-time decay rate is not".

The architectural fix is not more training — it is encoding the structure. A Hamiltonian Neural Network (Greydanus et al., 2019) cannot, by construction, drift in H. A dissipation-aware variant does the same for dE/dt. Both are listed as the natural next step in the README.

3 · The unconstrained MLP is a baseline, not a solution

The headline numbers are what an MLP without physics priors achieves on these two systems. They establish the bar against which any inductive-bias improvement (Hamiltonian, Lagrangian, Symplectic ODE-Net) must show a measurable benefit. The follow-ups in the README close that gap by encoding invariants instead of letting the network rediscover them.