Back to the syllabus

ECMJ: Physics-Informed Neural Networks

In this notebook we will introduce the idea of physics-informed neural networks (PINNs). It is largely copy-pasted from Rackauckas's Introduction to Scientific Machine Learning through Physics-Informed Neural Networks.

Let's start by understanding what a neural network really is, why they are used, and what kinds of problems they solve, and then we will use this understanding of a neural network to see how to solve ordinary differential equations with neural networks. For there, we will use this method to regularize neural networks with physical equations, the aforementioned physics-informed neural network, and see how to define neural network architectures that satisfy physical constraints to improve the training process.

Getting Started with Machine Learning: Adding Flux

To add Flux.jl we would do:

]add Flux

To then use the package we will then use the using command:

using Flux

If you prefer to namespace all commands (like is normally done in Python, i.e. Flux.gradient instead of gradient), you can use the command:

import Flux

Note that the installation and precompilation of these packages will occur at the add and first using phases, so they may take awhile (subsequent uses will utilize the precompiled form and take a lot less time!)

What is a Neural Network?

A neural network is a function:

\[ \text{NN}(x) = W_3\sigma_2(W_2\sigma_1(W_1x + b_1) + b_2) + b_3 \]

where we can change the number of layers ((W_i,b_i)) as necessary.

Let's assume we want to approximate some $R^{10} \rightarrow R^5$ function. To do this we need to make sure that we start with 10 inputs and arrive at 5 outputs. If we want a bigger middle layer for example, we can do something like (10,32,32,5). Size changing occurs at the site of the matrix multiplication, which means that we want a 32x10 matrix, then a 32x32 matrix, and finally a 5x32 matrix. This neural network would look like:

W = [randn(32,10),randn(32,32),randn(5,32)]
b = [zeros(32),zeros(32),zeros(5)]

3-element Vector{Vector{Float64}}:
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 0.0, 0.0, 0.0]

simpleNN(x) = W[3]*tanh.(W[2]*tanh.(W[1]*x + b[1]) + b[2]) + b[3]
simpleNN(rand(10))

5-element Vector{Float64}:
 -5.92565669282987
 -4.720051167869487
  5.080383639930327
  9.692560966153614
  6.7093689764387365

This is our direct definition of a neural network. Notice that we choose to use tanh as our activation function between the layers.

Defining Neural Networks with `Flux.jl`

One of the main deep learning libraries in Julia is Flux.jl. Flux is built on top of language-wide automatic differentiation libraries, which means that one can write a program in a manner that it has easily accessible fast derivatives.

In the documentation you will find that the way a neural network is defined is through a Chain of layers. A Dense layer is the kind we defined above, which is given by an input size, an output size, and an activation function. For example, the following recreates the neural network that we had above:

using Flux
NN2 = Chain(Dense(10,32,tanh),
           Dense(32,32,tanh),
           Dense(32,5))
NN2(rand(10))

5-element Vector{Float64}:
  0.048815788822765244
 -0.4956440370228493
 -0.03316888914300169
 -0.023568531916385443
 -0.04649683999257767

Notice that Flux.jl as a library is written in pure Julia, which means that every piece of this syntax is just sugar over some Julia code that we can specialize ourselves. This is the advantage of having a language fast enough for the implementation of the library and the use of the library.

For example, the activation function is just a scalar Julia function. If we wanted to replace it by something like the quadratic function, we can just use an anonymous function to define the scalar function we would like to use:

NN3 = Chain(Dense(10,32,x->x^2),
            Dense(32,32,x->max(0,x)),
            Dense(32,5))
NN3(rand(10))

5-element Vector{Float64}:
  0.2600358055891201
  0.27756034574254107
 -0.18370021348162333
 -0.48168515473110574
 -0.10937134136174234

The second activation function there is what's known as a relu. A relu can be good to use because it's a very simple operation and satisfies a form of the UAT.

Digging into the Construction of a Neural Network Library

Again, as mentioned before, this neural network NN2 is simply a function:

simpleNN(x) = W[3]*tanh.(W[2]*tanh.(W[1]*x + b[1]) + b[2]) + b[3]

simpleNN (generic function with 1 method)

Let's dig into the library and see how that's represented. First, let's figure out where Dense comes from and what it does.

using InteractiveUtils
@which Dense(10,32,tanh)

Flux.Dense(in::Integer, out::Integer, σ; kw...) in Flux at /home/enatale/.julia/packages/Flux/js6mP/src/deprecations.jl:63

If we go to that spot of the documentation, we find something similar to this (this code is from 2020, the code changed a bit now but we stick with that version for simplicity's sake).

struct Dense{F,S<:AbstractArray,T<:AbstractArray}
  W::S
  b::T
  σ::F
end

function Dense(in::Integer, out::Integer, σ = identity; initW = glorot_uniform, initb = zeros)
  return Dense(initW(out, in), initb(out), σ)
end

function (a::Dense)(x::AbstractArray)
  W, b, σ = a.W, a.b, a.σ
  σ.(W*x .+ b)
end

First, Dense defines a struct in Julia, which holds a weight matrix W, a bias vector b, and an activation function σ. The function called Dense is what's known as an outer constructor which defines how the Dense type is built. If you give it two integers, then what it will do is take random initial W and b matrices, and then it will build the type with those matrices.

The last portion is what is known as a callable struct, or a functor. It defines the dispatch for how calls work on the struct. As a quick demonstration, let's define a type A with a field x, and then make instances of A be the function x+y:

struct A
  x
end

(a::A)(y) = a.x + y

a = A(2)
a(3)

This is similar to using an object in a way that references the self in object-oriented programming, though it's a bit more general due to allowing dispatching, i.e. this can then dependent on the input types as well.

Let's look again at that Dense call:

function (a::Dense)(x::AbstractArray)
  W, b, σ = a.W, a.b, a.σ
  σ.(W*x .+ b)
end

This means that Dense is a function that takes in an x and computes σ.(W*x.+b), which is precisely how we defined the layer before. To see that this is just a function, let's call it directly:

f = Dense(32,32,tanh)
f(rand(32))

32-element Vector{Float64}:
 -0.48098595837880276
 -0.007472243440959084
 -0.08822884963851822
 -0.2219151448476703
 -0.6521646092417378
  0.46740280654976957
  0.7343750450134997
 -0.872416850060314
 -0.11209154221598988
  0.18890372385118495
  ⋮
 -0.012825553157454669
 -0.10252390901127631
  0.4502062698982919
 -0.10404026444502046
  0.7856565999008455
 -0.4685091773009671
  0.2755381403121115
  0.0380439574197985
 -0.2687716906326908

So okay, Dense objects are just functions that have weight and bias matrices inside of them.

Now what does Chain do?

@which Chain(1,2,3)

Flux.Chain(xs...) in Flux at /home/enatale/.julia/packages/Flux/js6mP/src/layers/basic.jl:39

gives us:

struct Chain{T<:Tuple}
  layers::T
  Chain(xs...) = new{typeof(xs)}(xs)
end

applychain(::Tuple{}, x) = x
applychain(fs::Tuple, x) = applychain(tail(fs), first(fs)(x))

(c::Chain)(x) = applychain(c.layers, x)

The ... is known that the slurp operator, which allows for "slurping up" multiple arguments into a single object xs.

The function Chain(xs...) = new{typeof(xs)}(xs) is an inner constructor which builds a new instance of Chain where layers is a tuple of the inputs. This means that in our case where we put a bunch of Dense inside of there, layers is a tuple of functions.

What does Chain do?

(c::Chain)(x) = applychain(c.layers, x)

This takes the tuple of functions and then does applychain on it.

applychain(::Tuple{}, x) = x
applychain(fs::Tuple, x) = applychain(tail(fs), first(fs)(x))

applychain is a recursive function which applies the first element of the tuple onto x, then it calls applychain to call the second function onto x, repeatedly until there are no more functions in which case it returns x. This is exactly equivalent to the neural network we defined by hand.

Detail: Recursion

Why recursion? If you define a function, look at its type:

ff(x,y) = 2x+y
typeof(ff)

typeof(Main.##WeaveSandBox#291.ff) (singleton type of function ff, subtype 
of Function)

Notice that its type is simply typeof(ff) which is unique to the function, i.e. every single function is its own struct. This is exactly what a function is in Julia. A function definition lowers at the parser level to something like:

struct ff2 <: Function end
(_::ff2)(x,y) = 2x + y
const ff = ff2()

This means that the primitive operation here that everything really comes down to is calls on structs.

Why is this done with unique singleton types, i.e. types are types where every instance is equivalent? If we want the compiler to be able to optimize with respect to which function we are handling inside of another function, then we need what-function-we-are-dealing-with as compile-time information, which necessitates being type information.

Tuples are covariant and heterogeneously typed with a parameter per internal object. For example:

tup = (1.0,1,"1")
typeof(tup)

Tuple{Float64, Int64, String}

This means that it is possible to infer outputs of a tuple even if it's heterogeneously typed by making good use of constant literals. For example, the expression tup[1] will be inferred to have the output Float64. However, note that if i is not a compile-time constant, then tup[i] cannot be inferred since, given what the compiler knows, the output could be either a Float64, an Int64, or a String.

So now let's think back to our tuple of functions. By what we described before, tup = (f,g,h) is going to have a different type for each of the functions and thus could not specialize on the inputs if we used tup[i]. Therefore:

for i in 1:length(tup)
  x = tup[i](x)
end

would be slow if the function call cost is small compared to the dispatch cost of about 100ns.

How can you get around it? If everything was constant literals then this would specialize:

tup[3](tup[2](tup[1](x)))

would fully specialize and infer, and the compiler would have full knowledge of the entire call chain as if it were written out as straightline code. Now if we look at the recursion again:

applychain(::Tuple{}, x) = x
applychain(fs::Tuple, x) = applychain(tail(fs), first(fs)(x))

we see that, at compile-time, we know that typeof((f,g,h)) = Tuple{typeof(f),typeof(g),typeof(h)}, and so we know that the first first(fs) will be f, and can specialize on this. We know then that tail(fs) has to be the (g,h) and so then we recurse and know that g is first and so on. This means that this scheme is equivalent to have written out xs[3](xs[2](xs[1](x))) and is thus generating code perfectly specialized to the order and amount of functions we had put into the Chain.

This kind of abstraction, an abstraction where all of the overhead compiles away at compile time, is known as a zero-cost abstraction.

Training Neural Networks

Training a neural network means finding weights that minimize a loss function. For example, let's say we wanted to make our neural network be the constant function 1 for any input $x \in [0,1]^{10}$. We can then write the loss function:

NN = Chain(Dense(10,32,tanh),
           Dense(32,32,tanh),
           Dense(32,5))
loss() = sum(abs2,sum(abs2,NN(rand(10)).-1) for i in 1:100)
loss()

2093.259751155131

This loss function takes 100 random points in [0,1] and then computes the output of the neural network minus 1 on each of the values, and sums up the squared values (abs2).

What are the weights? Since we're using the Flux callable struct style from above, the weights are those inside of the NN chain object, which we can inspect:

NN[1].weight # The W matrix of the first layer

32×10 Matrix{Float32}:
 -0.127268    0.101832    -0.31904    …  -0.00614071  -0.0822478  -0.185687
 -0.181967   -0.165188    -0.216334      -0.203361     0.267854    0.244134
 -0.309445    0.269558     0.371136      -0.0866667   -0.365984    0.202069
  0.0656755   0.262549     0.0497751     -0.061025     0.0162307   0.234664
  0.337349    0.0983788   -0.155607      -0.360892    -0.302688    0.200906
 -0.0183754  -0.148015     0.258435   …  -0.0318864    0.0922259  -0.100319
 -0.259249   -0.172325    -0.139059      -0.275432     0.351728    0.27045
 -0.240534   -0.308785     0.0567781     -0.0817101   -0.0812175  -0.150369
  0.19152     0.129077    -0.213393       0.164026    -0.0890567  -0.345905
  0.0699575   0.139349    -0.0585049      0.252035    -0.166448   -0.287739
  ⋮                                   ⋱                           
 -0.078697    0.281806    -0.169097       0.0956756    0.128294   -0.200389
 -0.259278    0.217048    -0.0608315      0.1947      -0.0957499   0.288208
  0.10304    -0.17574      0.207632   …  -0.00962717   0.22832    -0.040411
9
 -0.348308    0.0826371    0.371522      -0.0445189    0.094233   -0.162011
 -0.291102   -0.290277    -0.32764       -0.049699     0.288599   -0.12259
 -0.351748    0.12318      0.205297      -0.239283    -0.27476    -0.335803
  0.220391   -0.239494    -0.0598302     -0.374683    -0.0159074  -0.164623
 -0.143357    0.267317    -0.0978412  …  -0.312693    -0.0343308  -0.304393
  0.0240379   0.00152527   0.26833       -0.278581    -0.0983086   0.305959

Let's grab all of the parameters together:

p = Flux.params(NN)

Params([Float32[-0.12726755 0.10183183 … -0.082247764 -0.18568745; -0.18196
733 -0.16518757 … 0.26785406 0.2441341; … ; -0.14335696 0.26731664 … -0.034
33077 -0.30439258; 0.024037888 0.0015252654 … -0.09830861 0.30595902], Floa
t32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0
, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.2684807 0.27140662 … -0.1265037
8 0.007865437; -0.031246435 -0.23176686 … 0.25134623 0.2086786; … ; 0.15278
02 0.21432312 … 0.26593864 -0.22966343; 0.21472162 0.108200446 … 0.26746777
 0.2545381], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0
.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.0881743 0.3487
688 … -0.30875495 -0.013353838; 0.11881395 0.30142885 … 0.13636072 -0.13467
993; … ; 0.12445466 0.14080045 … 0.18669271 0.30848077; 0.11465404 0.000891
4495 … 0.006729796 0.16443034], Float32[0.0, 0.0, 0.0, 0.0, 0.0]])

That's a helper function on Chain which recursively gathers all of the defining parameters. Let's now find the optimal values p which cause the neural network to be the constant 1 function:

Flux.train!(loss, p, Iterators.repeated((), 10000), ADAM(0.1))

Let's check the loss:

loss()

2.177021172226092e-6

This means that NN(x) is now a very good function approximator to f(x) = ones(5)!

Why neural networks in a simulation course?

What we did was find parameters that made NN(x) act like a function f(x). In any case where one is acting on data (x,y), the idea is to assume that there exists some underlying mathematical model f(x) = y. The inference problem is to then figure out what function f should be.

Why neural networks? Neural networks satisfy two properties.

The first of which is known as the Universal Approximation Theorem (UAT), which in simple non-mathematical language means that, for any ϵ of accuracy, if your neural network is large enough (has enough layers, the weight matrices are large enough), then it can approximate any (nice) function f within that ϵ. Therefore, we can reduce the problem of finding missing functions, the problem of machine learning, to a problem of finding the weights of neural networks, which is a well-defined mathematical optimization problem.

But there are many other functions with this property. Why neural networks, specifically? For example, you will have learned from analysis that $a_0 + a_1 x + a_2 x^2 + \ldots$ arbitrary polynomials (Taylor series) can be used to approximate any analytic function.

That's all for one dimension. How about two dimensional functions? It turns out it's not difficult to prove that tensor products of universal approximators will give higher dimensional universal approximators. But, in that case, we have to compute every combination of terms. This means that if we used n coefficients in each dimension d, the total number of coefficients to build a d-dimensional universal approximator from one-dimensional objects would need $n^d$ coefficients. This exponential growth is known as the curse of dimensionality.

The second property of neural networks that makes them applicable to machine learning is that they overcome the curse of dimensionality: the size of neural networks to sufficiently approximate a d-dimensional function grows as a polynomial of d, rather than exponential. This means that there's some dimensional cutoff where for $d>cutoff$ it is more efficient to use a neural network.

Neural networks have a few other properties to consider as well:

Neural networks are "easy" to compute. There's good software for them, GPU-acceleration, and all other kinds of tooling that make them particularly simple to use.
There are proofs that in many scenarios for neural networks the local minima are the global minima, meaning that local optimization is sufficient for training a neural network. Global optimization is much more expensive than local methods like gradient descent, and thus this can be a good property to abuse for faster computation.
The assumptions of the neural network can be encoded into the neural architectures. A neural network where the last layer has an activation function x->x^2 is a neural network where all outputs are positive. This means that if you want to find a positive function, you can make the optimization easier by enforcing this constraint. A lot of other constraints can be enforced, like tanh activation functions can make the neural network be a smooth (all derivatives finite) function, or other activations can cause finite numbers of learnable discontinuities.
Generating higher dimensional forms from one dimensional forms does not have good symmetry: for example, the two-dimensional tensor Fourier basis does not have a good way to represent $sin(xy)$. Neural networks, instead, are naturally not aligned to a basis. More details can be found in this talk about function approximation for multidimensional integration.

Why Differential Equations?

What we wish to do now is to use these properties of neural networks to improve the way that we investigate our scientific models.

Why do differential equations come up so often in as the model in the scientific context? Essentially, all scientific experiments always have to test how things change. For example, you take a system now, you change it, and your measurement is how the changes you made caused changes in the system. This boils down to gather information about how, for some arbitrary system $y = f(x)$, $\Delta x$ is related to $\Delta y$. Thus what you learn from scientific experiments, what is codified as scientific laws, is not "the answer", but the answer to how things change. This process of writing down equations by describing how they change precisely gives differential equations.

Solving ODEs with Neural Networks

The process of solving a differential equation with a neural network, or using a differential equation as a regularizer in the loss function, is known as a physics-informed neural network, since this allows for physical equations to guide the training of the neural network in circumstances where data might be lacking.

Background: A Method for Solving Ordinary Differential Equations with Neural Networks

The idea is to solve differential equations using neural networks by representing the solution by a neural network and training the resulting network to satisfy the conditions required by the differential equation. This is a result first due to Lagaris et. al from 1998.

Let's say we want to solve a system of ordinary differential equations

\[ u' = f(u,t) \]

with $t \in [0,1]$ and a known initial condition $u(0)=u_0$. To solve this, we approximate the solution by a neural network:

\[ NN(t) \approx u(t) \]

If $NN(t)$ was the true solution, then it would hold that $NN'(t) = f(NN(t),t)$ for all $t$. We turn this condition into our loss function:

\[ L(p) = \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

The choice of $t_i$ could be done in many ways: it can be random, it can be a grid, etc. Anyways, when this loss function is minimized (gradients computed with standard reverse-mode automatic differentiation), then we have that $\frac{dNN(t_i)}{dt} \approx f(NN(t_i),t_i)$ and thus $NN(t)$ approximately solves the differential equation.

Note that we still have to handle the initial condition. One simple way to do this is to add an initial condition term to the cost function. This would look like:

\[ L(p) = (NN(0) - u_0)^2 + \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

While that would work, it can be more efficient to encode the initial condition into the function itself so that it's trivially satisfied for any possible set of parameters. For example, instead of directly using a neural network, we can use:

\[ g(t) = u_0 + tNN(t) \]

as our solution. Notice that $g(t)$ is thus a universal approximator for all continuous functions such that $g(0)=u_0$. Since $g(t)$ will always satisfy the initial condition, we can train $g(t)$ to satisfy the derivative function then it will automatically be a solution to the derivative function. In this sense, we can use the loss function:

\[ L(p) = \sum_i \left(\frac{dg(t_i)}{dt} - f(g(t_i),t_i) \right)^2 \]

where $p$ are the parameters that define $g$, which in turn are the parameters which define the neural network $NN$ that define $g$. This reduces down, once again, to simply finding weights which minimize a loss function.

Coding Up the Method

Let's implement this method with Flux. Let's define a neural network to be the NN(t) above. To make the problem easier, let's look at the ODE:

\[ u' = \cos (2\pi t) \]

and approximate it with the neural network from a scalar to a scalar:

using Flux
NNODE = Chain(x -> [x], # Take in a scalar and transform it into an array
           Dense(1,32,tanh),
           Dense(32,1),
           first) # Take first value, i.e. return a scalar
NNODE(1.0)

-0.10945529715633792

Instead of directly approximating the neural network, we will use the transformed equation that is forced to satisfy the boundary conditions. Using u0=1.0, we have the function:

g(t) = t*NNODE(t) + 1f0

g (generic function with 1 method)

as our universal approximator. For this to be a function that satisfies

\[ g' = \cos (2\pi t) \]

we would need that:

using Statistics
ϵ = sqrt(eps(Float32))
loss() = mean(abs2(((g(t+ϵ)-g(t))/ϵ) - cos(2π*t)) for t in 0:1f-2:1f0)

loss (generic function with 1 method)

would be minimized.

opt = Flux.Descent(0.01)
data = Iterators.repeated((), 5000)
iter = 0
cb = function () #callback function to observe training
  global iter += 1
  if iter % 500 == 0
    display(loss())
  end
end
display(loss())
Flux.train!(loss, Flux.params(NNODE), data, opt; cb=cb)

0.5206489936435144
0.4913892508210181
0.4609371980558511
0.362373441409376
0.14930171120149932
0.031131566920083547
0.015563900652676323
0.012004493634237374
0.010685519180098907
0.010075944844511426
0.009651558436101743

How well did this do? Well if we take the integral of both sides of our differential equation, we see it's fairly trivial:

\[ \int g' = g = \int \cos 2\pi t = C + \frac{\sin 2\pi t}{2\pi} \]

where we defined $C = 1$. Let's take a bunch of (input,output) pairs from the neural network and plot it against the analytical solution to the differential equation:

using Plots
t = 0:0.001:1.0
plot(t,g.(t),label="NN")
plot!(t,1.0 .+ sin.(2π.*t)/2π, label = "True Solution")

We see that it matches very well, and we can keep improving this fit by increasing the size of the neural network, using more training points, and training for more iterations.

Harmonic-Oscillator Informed Training

Using this idea, differential equations encoding physical laws can be utilized inside of loss functions for terms which we have some basis to believe should approximately follow some physical system.

Let's investigate this last step by looking at how to inform the training of a neural network using the harmonic oscillator. Let's assume that we are taking measurements of (position,force) in some real one-dimensional spring pushing and pulling against a wall.

Instead of the simple spring, let's assume we had a more complex spring, for example, let's say $F(x) = -kx + 0.1sin(x)$ where this extra term is due to some deformities in the apparatus. By Newton's law of motion we have a second order ordinary differential equation:

\[ x'' = -kx + 0.1 \sin(x) \]

We can use the DifferentialEquations.jl package to solve this differential equation and see what this system looks like:

using DifferentialEquations
k = 1.0
force(dx,x,k,t) = -k*x + 0.1sin(x)
prob = SecondOrderODEProblem(force,1.0,0.0,(0.0,10.0),k)
sol = solve(prob)
plot(sol,label=["Velocity" "Position"])

Let's say we want to learn how to predict the force applied on the spring at each point in space, $F(x)$. Suppose that we want to learn a function but we only have 4 measurements, which includes the information about (position,velocity,force) at evenly spaced times:

t = 0:3.3:10

plot_t = 0:0.01:10
data_plot = sol(plot_t)
positions_plot = [state[2] for state in data_plot]
force_plot = [force(state[1],state[2],k,t) for state in data_plot]

# Generate the dataset
dataset = sol(t)
position_data = [state[2] for state in sol(t)]
force_data = [force(state[1],state[2],k,t) for state in sol(t)]

plot(plot_t,force_plot,xlabel="t",label="True Force")
scatter!(t,force_data,label="Force Measurements")

Can we train a neural network to approximate the expected force at any location for this spring?

Let's define a neural network to be $F(x)$ and see if we can learn the force function.

NNForce = Chain(x -> [x],
           Dense(1,32,tanh),
           Dense(32,1),
           first)

Chain(
  Main.##WeaveSandBox#291.var"#21#22"(),
  Dense(1 => 32, tanh),                 # 64 parameters
  Dense(32 => 1),                       # 33 parameters
  first,
)                   # Total: 4 arrays, 97 parameters, 644 bytes.

Now our loss function will be to match the force at the (position,force) pairs in the dataset:

loss() = sum(abs2,NNForce(position_data[i]) - force_data[i] for i in 1:length(position_data))
loss()

0.0004359553691029917

Our random parameters do not do so well, so let's train:

opt = Flux.Descent(0.01)
data = Iterators.repeated((), 5000)
iter = 0
cb = function () #callback function to observe training
  global iter += 1
  if iter % 500 == 0
    display(loss())
  end
end
display(loss())
Flux.train!(loss, Flux.params(NNForce), data, opt; cb=cb)

0.0004359553691029917
0.0003469337576801181
0.0002971988659740925
0.0002544318609716007
0.00021768289528495004
0.00018613018593760568
0.0001590582592385008
0.00013584783303288
0.00011596215718647006
9.89370057995314e-5
8.43708322703802e-5

The neural network almost exactly matched the dataset, but how well did it actually learn the real force function? Let's plot it to see:

learned_force_plot = NNForce.(positions_plot)

plot(plot_t,force_plot,xlabel="t",label="True Force")
plot!(plot_t,learned_force_plot,label="Predicted Force")
scatter!(t,force_data,label="Force Measurements")

The neural network can approximate any function, so it approximated a function that fits the data, but not the correct function. We somehow need to have more data... but where can we get more data?

We know Hooke's law, which is that the idealized spring should satisfy $F(x) = -kx$. This is a decent assumption for the evolution of the system:

force2(dx,x,k,t) = -k*x
prob_simplified = SecondOrderODEProblem(force2,1.0,0.0,(0.0,10.0),k)
sol_simplified = solve(prob_simplified)
plot(sol,label=["Velocity" "Position"])
plot!(sol_simplified,label=["Velocity Simplified" "Position Simplified"])

It definitely drifts near the end, but it should be a useful non-data assumption that we can add to improve the fitting. Assuming we know $k$, we can regularize this fitting by having a term that states our neural network should be the solution to the differential equation.

This term looks like what we had done before:

random_positions = [2rand()-1 for i in 1:100] # random values in [-1,1]
loss_ode() = sum(abs2,NNForce(x) - (-k*x) for x in random_positions)
loss_ode()

3.5066053530092005

If this term is zero, then $F(x) = -kx$, which is approximately true. So now let's put these together:

λ = 0.1
composed_loss() = loss() + λ*loss_ode()

composed_loss (generic function with 1 method)

where $λ$ is some weight factor to control the regularization against the physics assumption. Now we can train the physics-informed neural network:

opt = Flux.Descent(0.01)
data = Iterators.repeated((), 5000)
iter = 0
cb = function () #callback function to observe training
  global iter += 1
  if iter % 500 == 0
    display(composed_loss())
  end
end
display(composed_loss())
Flux.train!(composed_loss, Flux.params(NNForce), data, opt; cb=cb)

learned_force_plot = NNForce.(positions_plot)

plot(plot_t,force_plot,xlabel="t",label="True Force")
plot!(plot_t,learned_force_plot,label="Predicted Force")
scatter!(t,force_data,label="Force Measurements")

0.3507449061331904
0.00038928232614964946
0.0003679081784992187
0.0003497159407879777
0.0003339902226760945
0.00032020549790829635
0.0003079760432510769
0.00029700823390658315
0.00028707928167593256
0.00027801759174725843
0.00026968932745879493

And there we go: we have used knowledge of physics to help inform our neural network training process.

Conclusion

In this note we saw how differential equations could be solved using the function approximation technique of neural networks, solving differential equations and approximating data at the same time, allowing for physical knowledge to be embedded into the training process of a neural network. This is just one method which demonstrates how we can use scientific knowledge to improve fitting and allow for better data-driven simulations.