Jonty Sinai

Understanding Bayesian Inference

2020-04-19T00:00:00+00:00

What do we mean when we say “Bayesian inference”? More specifically, what does Bayesian inference mean for my machine learning or data modelling problem? In this blogpost I will introduce Bayesian inference and explain how it is a machine learning paradigm. More importantly I will attempt to intuitively bridge the gap between Bayesian inference as a theoretical framework and Bayesian inference as a machine learning approach.

The Real World: Data and Modelling

Let’s begin with the problem which we’re trying to solve in the first place: we have some acquired data and we would like to predict future instances. Using the same starting point as my post on neural ODE’s, we can formalise this setting as follows:

We have $N$ observations of $x, y$: $\mathscr{D} = \big\{(x_1, y_1), (x_2, y_2), …, (x_N, y_N)\big\}$. $\mathscr{X}$ is the input domain and $\mathscr{Y}$ is the output domain.
Given a new input, $x^{*}$, we would like to predict the unknown output: $y^{*}$.
We assume that there is some function which maps the input and output domains: $f:\mathscr{X} \to \mathscr{Y}$.

We model the problem as trying to find a function mapping from $\mathscr{X}$ to $\mathscr{Y}$.

The machine learning approach is to choose a flexible class of functions, typically neural networks, which are described by parameters $\theta$. We use a learning algorithm, such as stochastic gradient descent, to optimize the parameters $\theta$ such that the function $f$ is a good fit for the data. What we mean by a “good fit” is codified by a cost function which is minimized during the learning procedure. Traditionally this cost function measures the degree to which the data fits the model well. If not, then adjust the parameters of the model and measure again until the cost is sufficiently low.

We then make new predictions using the optimal parameter $\theta^{*}$. One shortcoming of this approach is that once we have found the optimal $\theta^{*}$, the prediction process is completely deterministic. This is a problem if we require uncertainty in our predictions:

How certain are we that patient $\mathscr{x}$ will have outcome $\mathscr{y}$?
How certain are we that image $\mathscr{x}$ can be classified as $\mathscr{y}$?
How certain are we that measurement $\mathscr{x}$ will yield signal $\mathscr{y}$?

In general uncertainty is desirable if the variance in the data is large. The learning algorithm finds $\theta^{*}$ which works for most of the data points which we’ve seen so far. How reliable will $\theta^{*}$ be, when $\mathscr{x}$ is an outlier - when it is not like most datapoints?

Uncertainty is also desirable when we don’t have a large number of samples. In this case, if the variance is large between each sample, then we may not have enough evidence to support any one particular value of $\theta$. Instead we would like a way of making predictions which accounts for our uncertainty on the value of the parameter $\theta$ itself.

This is precisely what we will try to achieve with Bayesian inference.

An old favourite: Bayes’ Theorem

As the name implies, Bayesian inference has a lot to do with Bayes’ Theorem. In fact, Bayes’ Theorem will form the bridge between the uncertainty in our data and the uncertainty in our predictions. You may already be familiar with Bayes’ Theorem (which I also covered in my post on conditional probability), but as a recap here is what we mean by Bayes’ Theorem:

Let $A$ and $B$ be random variables$^{*}$ with probability measure $\mathbb{P}$, such that $\mathbb{P}(A) \neq 0$, then \[ \mathbb{P}(B|A) = \frac{ \mathbb{P}(A | B)\mathbb{P}(B) }{ \mathbb{P}(A) }. \]

$\mathbb{P}(B)$ is known as the prior probability of $B$ - it is what we know (assume) about $B$ before we consider the effects $A$.

$\mathbb{P}(A)$ is known as the evidence of $A$ - it is everything we know about $A$.

$\mathbb{P}(A | B)$ is known as the likelihood of $A$, given $B$ - it is an estimate of how likely the $A$ that we observed is, given possible values of $B$.

$\mathbb{P}(B | A)$ is known as the posterior probability of $B$, given $A$ - it is the conditional probability of $B$ after we have obtained evidence of $A$.

Bayes’ Theorem is useful when it is difficult to calculate the posterior of $B$ directly. We use Bayes’ Theorem when we have obtained data on $A$ and we can measure the likelihood and prior with greater ease.

* for more detail on random variables and probability measures, see my introductory post on probability.

Bayes’ Theorem Meets the Real World

Let’s go back to the fundamental modelling problem. We have a parametric function, described by $\theta$, and we want to choose the right $\theta$ so that our function describes the data well. We also want to compute our uncertainty over $\theta$ so that we can make predictions with confidence, and know when this uncertainty is high.

But what does it mean to include uncertainty into our modelling of $f$ and $\theta$?

This means precisely to use probabilities - the mathematical language of uncertainty - and in particular it means to use probability distributions. Instead of optimizing for a single value of $\theta$, we can treat $\theta$ as a random variable which has highest probability at the optimal value.

But how do we ensure that the probability distribution over $\theta$ has highest probability at the optimal value?

We do this by using Bayes’ Theorem. In particular we can start with a best guess at what this distribution may be and call it our prior distribution

\[ P(\; \theta \;)^{*}. \]

How do we know what a good prior distribution is?

This is a difficult question but in theory our choice on the prior doesn’t matter if are able to obtain enough data. In practice the choice of prior will affect the posterior and it helps to encode (and test!) our assumptions on $\theta$. For example use a discrete probability distribution if $\theta$ is discrete, or a positive distribution if $\theta$ must be non-negative.

What’s important in Bayesian inference is that whatever choice of prior we choose, the additional data which we obtain will alter the shape of prior to inform our target posterior distribution. Going back to Bayes’ Theorem we can incorporate the data $\mathscr{D}$ into the formula using the likelihood

\[ P(\; \mathscr{D} \;|\; \theta \;). \]

It is through the likelihood that we encode the optimality of $\theta$ into the posterior, which is defined by the degree to which $\theta$ is a good fit for the data. Strictly speaking the likelihood is a function of $\theta$ - we assume the data has been observed and is fixed. With this point of view the likelihood can be interpreted as the likelihood of the observed data occurring, given the possible values of $\theta$. A higher likelihood indicates a better value of $\theta$.

In fact many machine learning loss functions are derived from the likelihood, usually in the form of the negative log-likelihood (applying the log transformation to the likelihood helps for numerical stability purposes, especially over a large dataset).

The likelihood function does not need to strictly be a probability distribution.

Now we can use Bayes’ Theorem to write down a formula for the posterior distribution of $\theta$:

\[ P(\; \theta \;|\; \mathscr{D} \;) = \frac{ P(\; \mathscr{D} \;|\; \theta \;)P(\; \theta \;) }{ P(\; \mathscr{D} \;)}. \]

The posterior distribution takes into account our uncertainty over $\theta$ (incorporated in the prior distribution), adjusted by the actual data that we observed (incorporated in the likelihood, and also the evidence of the data). The combination of the likelihood with the prior makes the posterior distribution suitable for estimating optimal values of $\theta$ with the added measure of uncertainty.

How does this posterior distribution help us?

Instead of calculating a single estimate of the optimal $\theta$ we look for a (posterior) distribution on $\theta$. From this distribution we can calculate a sample mean for $\theta$ and report our confidence over a credible interval - say the middle 95% of the distribution.

Going back to the original function estimation problem, this means that when we calculate predictions for $\mathscr{y}$, we first estimate $\theta$ from the posterior so that our predictions become conditional on $\theta$:

But it turns out that calculating this posterior can be hard with a capital H. For two reasons:

The likelihood function is often highly nonlinear and must include every sample from the training dataset. In addition the complexity of the likelihood increases with the dimension of the data.
The evidence, $P(\; \mathscr{D} \;)$, can be extremely hard to compute. It requires knowing how much probability to assign to every possible value which the data can take. In an uncertain world this is practically impossible.

The algorithmic insight is to use what’s called approximate inference. Instead of calculating the posterior directly, we will try to approximate it.

*Note: that whereas before I used blackboard bold $\mathbb{P}$ to denote the probability measure for individual probabilities, here I am using plain $P$ to denote a probability distribution function. These are subtly different things conceptually and the change in notation is a pedantic way of expressing that. In short the probability distribution function is a function which describes all of the probabilities of a random variable over its domain.

Approximate Inference

Broadly speaking there are three main types of approximate inference algorithms, with the first two being the most popular. Briefly, they are:

Monte Carlo Sampling: this is a broad class of algorithms which approximates a target distribution by computing samples. In Bayesian inference we use a special type of Monte Carlo sampling algorithm known as Markov Chain Monte Carlo (MCMC) which iteratively computes samples using only information from the previous sample and the target distribution we are trying to approximate: \[ \theta_{t} \leftarrow \theta_{t-1}, \ \theta_{t} \sim P(\; \theta \;|\; \mathscr{D} \;). \] With Monte Carlo algorithms, we do not try to explicitly compute the distribution, but rather try to compute candidate samples for $\theta$ until eventually most samples are close to the optimal value of $\theta$.
Variational$^{*}$ Inference: here we attempt to approximate the posterior distribution itself. We do this in much the same way that we use machine learning to approximate a function: choose a flexible function $q$ parametrised by $\phi$ and optimise $\phi$ so that \[ q(\theta) \approx P(\; \theta \;|\; \mathscr{D} \;). \]
Expectation Propagation: again we approximate the posterior with a tractable function $q$, except we specifically choose this function to be factorisable into a product of distributions which are conditional on each other. We then use message passing algorithms to iteratively update the factors of $q$ until it is a good fit for the posterior.

While I won’t go into more detail into each of these algorithms - they are entire blogposts (and more!) unto themselves - I will try and unravel the basic fundamentals of approximate inference for the remainder of this blopost.

$^{*}$The name variational inference comes from variational calculus which is the calculus of finding functions by varying their parameters. The function $q$ is sometimes known as the guide.

A Chicken and Egg problem

You may have noticed that approximate inference has a chicken and egg problem. We want to approximate the posterior by drawing samples, but how can we draw samples from the posterior if we can’t compute it directly? Or we want to approximate the posterior with some parametrised guide distribution, but how do we know that it is a good match for the posterior without evaluating the posterior directly?

We get around this as follows:

The denominator in Bayes’ Theorem is a normalization constant. If we ignore this constant (a constant function really), then the posterior distribution is proportional to the product of the likelihood and the prior (both of which are computable to us), which we write as follows: \[ P(\; \theta \;|\; \mathscr{D} \;) \propto P(\; \mathscr{D} \;|\; \theta \;)P(\; \theta \;).
\]
Furthermore in approximate inference we are more interested in the argmax of the posterior distribution - which $\theta$ has maximum probability? - rather than the probability on $\theta$ itself. Thus we can exploit the property of optimization where if $g$ is our objective function and $C$ is a constant then: \[ argmax\{\; g \;\} = argmax\{\; \frac{1}{C}g \;\}. \]

In the case of MCMC specifically, we use an algorithm which explicitly does not require knowing $P(\; \mathscr{D} \;)$ in order to update samples of $\theta$, doing so until we start sampling the $\theta$ with maximum probability.
Once we have the posterior up to a normalization constant, we simply need to normalize the posterior to turn it into a probability distribution.

Unravelling the graph

We can express the cyclic nature of Bayesian inference graphically in the following way:

Start with a prior distribution on $\theta$, which we can use to sample $\theta$.

What is the likelihood of $\theta$, given the data?

We get the posterior probability of $\theta$, which we can use to sample a better estimate of$\theta$.

But we can still test the likelihood of this $\theta$…

Now we can see what we are trying to do with Bayesian inference a bit better. Start with a prior and likelihood (going forwards on the graph), and then go backwards to calculate a suitable posterior. When we can’t calculate this posterior exactly (feasible for only the simplest of models), then we resort to approximate inference. Approximate inference works by iteratively updating the posterior (forwards) and then checking to see if the samples for $\theta$ are good (backwards). In general approximate inference algorithms work as follows:

Given a prior distribution and likelihood function:

Choose an initial starting point for $\theta$.
Use the likelihood function and prior to estimate a posterior (or posterior sample) from the current value of $\theta$.
Suitably update the posterior based on how good (or bad) a fit we have so far.
Sample a new $\theta$ value from the current posterior.
Repeat steps 2-4 until we get a suitable fit for the posterior.

Note: For variational inference we instead update the parameter $\phi$ to obtain a good functional approximation of the posterior distribution, and from this approximation we will sample $\theta$.

Going back to the graph, we are essentially unrolling the cyclic relationship between the posterior estimate and the sampling of $\theta$ into a successive sequence of algorithm iterations:

Start with an initial value for $\theta$. Successively update the posterior and then sample a new estimate of $\theta$.

The exact details of steps 2-4 and when we determine to terminate the algorithm will depend on the choice of algorithm used. Although these algorithms are beyond the scope of this introduction to approximate inference, I intend on covering them in future posts.

Tying it back together

What we have seen so far is that when we want to express uncertainty over the modelling parameters in a machine learning problem, we can use Bayes’ Theorem to help us. Specifically

When we have a dataset $\mathscr{D} = \big\{(x_1, y_1), (x_2, y_2), …, (x_N, y_N)\big\}$,
And a model, $f: \mathscr{X} \to \mathscr{Y}$, with parameters $\theta$,
Then we can use Bayes’ Theorem to describe a posterior distribution over the modelling parameters $\theta$, by specifying a prior distribution and a likelihood function so that: \[ P(\; \theta \;|\; \mathscr{D} \;) \propto P(\; \mathscr{D} \;|\; \theta \;)P(\; \theta \;). \]

Once we have this posterior distribution, we can make a prediction $y^{*}$ at a new point $x^{*}$, using the so-called predictive distribution - the probability distribution over $y^{*}$ given $x^{*}$ and our uncertainty over the parameters $\theta$:

\[ P(\; y^{*} \;| x^{*}; \mathscr{D} \;) = \int_{\theta} P(\; y^{*} \;| x^{*}, \theta \;)P(\; \theta \;|\; \mathscr{D} \;)d\theta. \]

In practice we approximate $P(\; y^{*} \;|\; x^{*}, \theta \;)$ by the function $f$ and we can use the posterior distribution to compute samples $\big\{ \theta_1, \theta_2, … \theta_M \big\}$. Then we can calculate a sample mean for $y^{*}$ using a Monte Carlo estimate:

\[ y^{*} \approx \; \widehat{y} \; = \; \frac{1}{M}\sum_{s=1}^{M}f(x^{*}; \theta_s). \]

To measure the uncertainty of our prediction we can calculate a sample variance:

\[ Var(y^{*}) \approx \frac{1}{M}\sum_{s=1}^{M}\big( f(x^{*}; \theta_s) - \widehat{y} \big)^{2}. \]

Bayesian Inference is a Modelling Paradigm

In traditional machine learning we specify a model and try and find the parameters of the model which best fit the data. The cost function which we use, typically the likelihood, gives us a measure of how well the parameters fit the data.

Bayesian inference instead seeks to find the most probable parameters, given the data. By seeking most probable, we imbue the model with predictive uncertainty. In Bayesian inference we model a posterior distribution over the parameters.

Bayesian Inference is an Optimization Paradigm

In the traditional machine learning approach we can only find a single point estimate of the optimal parameter $\theta^{*}$. This is known as a maximum likelihood estimate (MLE) and can be interpreted as follows: if true, we would likely observe the data which we have under this model.

In the Bayesian inference approach, we use the likelihood to reshape the prior distribution into a posterior distribution which places highest probability over the optimal $\theta$. This $\theta$ is sometimes known as the maximum a posteriori estimate (MAP), and can be interpreted as the parameter which is a best fit for the data, given our uncertainty.

Take note that when using the predictive distribution above we are not explicitly using the MAP estimate for $\theta$ but rather taking many samples of $\theta$. Of course if our uncertainty over $\theta$ is not too large, than most of the $\theta$ samples will be close to the MAP estimate.

In the algorithmic approach, our aim is to optimize $\theta$ based on the data. Except now we can optimize $\theta$ according to the posterior distribution.

Bayesian Inference is a Machine Learning Paradigm

In modern machine learning, Bayesian inference gives us a framework by which we can resolve the desire for an accurate algorithm with a measure of uncertainty. Using approximate inference we can scale Bayesian inference to large datasets and models with high dimensional parameters, such as neural networks.

Really this whole post has been about how we can rethink machine learning using Bayes’ Theorem, and that ultimately, is what Bayesian inference is.

Bonus: A Little Bit of Information Theory

In a way we can think of Bayesian inference (and machine learning in general) as an exercise in compression. We don’t have access to all the data in the world for our problem. Instead we use the data which we have to optimize the parameters $\theta$ of a function which we hope can describe the relationship between $\mathscr{X}$ and $\mathscr{Y}$. In general the dimensionality of $\theta$ is much lower than the data, so really we are compressing all the information in our dataset into the parameter $\theta$.

When we go in the forward direction of Bayesian inference, we are finding the $\theta$ which most likely describes the data. In other words, the likelihood helps us to encode the data into $\theta$. When we go in the backward direction (posterior inference), we sample many values of $\theta$ to account for our uncertainty. If our posterior distribution is good, then we will almost always sample the optimal value of $\theta$ such that such that we should be able to recover original dataset. In other worlds the posterior helps us to decode the input to our target values.

Photo by Dattatreya Patra on Unsplash

Understanding Neural ODE’s

2019-01-18T00:00:00+00:00

In this blogpost I explore how ODE’s can be used to solve data modelling problems. I take a deep dive into the data modelling problem at hand and present ODE’s (which model rates of change) as an alternative to regression (which attempts to model data directly). Later I introduce the extension to neural ODE’s. To keep the focus on neural ODE’s I’ll assume that you have knowledge of linear regression, deep learning and basic calculus.

This is a long post and I hope that you will take the time to read it in its entirety. I’ve split the post into 5 numbered parts, which I summarise at the end as 5 conceptual steps (leaps). If at times you feel that you are losing track of the bigger picture, then feel free to scroll to the end to see where each idea fits.

Many of you may have recently come across the concept of “Neural Ordinary Differential Equations”, or just “Neural ODE’s” for short. Based on a 2018 paper by Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bettenourt and David Duvenaud from the University of Toronto, neural ODE’s became prominent after being named one of the best student papers at NeurIPS 2018 in Montreal. Shortly afterwards a media feature in the MIT Tech Review and a front page appearance on Hacker News helped propel neural ODE’s into the machine learning limelight.

MIT Tech Review described the architechture as a “radical new design” with the “potential to shake up the field—in the same way that Ian Goodfellow did when he published his paper on GANs.”

This may sound like hype, and perhaps it is, however I’m intrigued and excited by neural ODE’s for several reasons. At first glance they appear to have immediate practical advantages (this is where I believe the hype differs from GANs), for example in continuous time settings. Secondly ODE’s have a long history in applied and pure mathematics. They are well studied in physics, engineering and other sciences. Most modern scientific programmes have extremely well-tested and high performing differential equation libraries.

Finally neural ODE’s bring a powerful modelling tool out of the woodwork. When I was an undergrad in mathematics, we were taught that we solve applied maths problems using differential equations. Today we are taught that we solve applied maths problems using machine learning. I got into machine learning because I found neural networks to be the most promising tool for solving problems with maths. Neural ODE’s open up a different arena for solving problems using the muscle power of neural networks.

In a word, they are a indeed a “radical” new paradigm in machine learning.

In this blogpost I explore this new paradigm, starting with the initial data modelling problem. I’ll introduce ODE’s as an alternative approach to regression and explain why they may hold an advantage. I’ll give a brief perspective of the world of numerical ODE solvers. After that I’ll introduce neural ODE’s as they are described in the paper. I’ll briefly explain how backpropogation is implemented, which is the major contribution of the paper. Finally I will bring everything together.

1. Modelling Data

In a classical data modelling setting we have a set of $N$ pairs of data points, $\mathscr{D} = \big\{(x_1, y_1), (x_2, y_2), …, (x_N, y_N)\big\}$. $\mathscr{X}$ is the input domain and $\mathscr{Y}$ is the output domain. Given a new data point, $x^{*}$, we would like to make a prediction about it’s value $y^{*}$. We can view this problem as:

Nature generates the data: we can try and describe the data using an algorithm and then use the resulting model to make new predictions. For more on this approach, see the seminal essay, Two Cultures, on data modelling by the late Leo Breiman.

Our original dataset is generated by nature (physical, social, economic or otherwise). We have no good sophisticated way of remodelling a reliable data generation process, so instead we treat nature as a black box and bypass it with an algorithm. The machine learning approach is to iteratively find a function which best describes the data. The process of finding this function is known as a learning algorithm. In short, machine learning can be thought of as repurposing the original data modelling problem into a function approximation problem. In modern machine learning, particularly deep learning, these functions are highly flexible neural networks. Thus the original data modelling problem becomes something like this:

Neural networks can be viewed as an algorithmic approximation which bypasses the generative process of the data. They are trained over many iterations of an optimisation loop.

Part of the success of machine learning (which I will use interchangeably with deep learning) lies in the enormous flexibility of neural networks. They are high-dimensional, have millions of parameters over which we can compress patterns found in data, are highly nonlinear and can be implemented using highly optimised computing frameworks.

However, in order to understand where neural ODE’s fit in, it will be useful to abstract away from neural networks and return to the original function approximation perspective.

Our ultimate goal will be to find a robust mapping from $\mathscr{X}$ to $\mathscr{Y}$.

We are now at the fundamental mathematical problem where there is some function, which we would like to know, which sends points in $\mathscr{X}$ to points in $\mathscr{Y}$:

\[ f: \mathscr{X} \to \mathscr{Y}.
\]

Two Basic Approches: ODE’s vs Regression

Given the dataset, $\mathscr{D}$, there are two basic approaches for solving this problem. The first approach, known as regression, should be familiar to anyone working in machine learning. The other approach, which is the dark horse here, is to use an ordinary differential equation.

To explain these two approaches, let’s suppose that $\mathscr{X}$ and $\mathscr{Y}$ are both just the ordinary real number line, $\mathbb{R}$, so that the problem can be visualised and intuitioned easily, and that we have the following arbitrary data points:

Linear data in $\mathbb{R}^2$.

How should we go about finding the function

\[ f: \mathbb{R} \to \mathbb{R}.
\]

which best describes the data?

Curve Fitting

I’ve kept this situation simple and low dimensional so that we can use ordinary linear regression. The data has a somewhat linear shape so we describe it using the parametric form of a line:

\[ \widehat{y} = ax + b, \]

where $y_i = \widehat y_i + \epsilon$, and $\epsilon$ is an error term added to our model, representing random noise in the data. In linear regression we try to minimise the mean square error

\[ \mathscr{L}(a, b) = \frac{1}{n}\sum_{i=1}^{N}\big(y_i - \widehat{y}_i\big)^2. \]

The mean square error is a loss function which is a function of the parameters $a$ and $b$ - since we take our data as fixed and each $\widehat y_i$ is evaluated at known $x_i$ for unknown $a$, $b$. In machine learning we find the optimal choice of the parameters which minimise the loss function. Call them $a^*$ and $b^*$. The function which we’ve approximated is then the line

\[ f^*(x) = a^*x + b^*, \]

which will look something like this:

The line of best fit produced by linear regression.

In general this is known as curve fitting.

2. Modelling Rates of Change

In order to optimise the loss function we typically require the function which we’re approximating to be differentiable. In research and in practice, great care is taken to ensure that neural network architectures are indeed differentiable. Now in Calculus I, we learn that every differentiable function $f$ has a derivative $f’$, which is the continuous limit of the rate of change of $f$:

\[ \frac{df}{dx} = f’. \]

We also learn in calculus that if a function satisfies certain continuity properties then we can go the other way round and integrate it, so that if

\[ F(x) = \int f(x)dx, \]

then

\[ F’(x) = f. \]

How does this relate to our modelling problem? With regression, we assumed that there was a continuous and differentiable relationship between $x$ and $y$, described by the function $f$:

\[ y = f(x). \]

In regression we try to find $f$ directly. But if $f$ is differentiable, what if we tried to find its derivative instead? This amounts to searching for $f$ indirectly by differentiating the regression relationship:

\[ \frac{dy}{dx} = f’(x). \]

This is a basic form of an ordinary differential equation, or an ODE. Solving the ODE is equivalent to solving the integral

\[ f(x) = \int f’(x)dx, \]

and can therefore be viewed as function approximation, only here we are approximating the derivative instead.

ODE’s: Basic Form

In the basic form of an ODE we allow the derivative to depend not only on $x$, but also on $y$. This allows for greater modelling flexibility. Now in calculus you learnt that the integral depends on at least one unknown constant value. To solve for this constant we need a pair of points, $x_0$ and $y_0$. If there is more than one constant, we need more pairs of points. Luckily in the machine learning scenario we have an entire dataset, $\mathscr{D}$, of $N$ data points!

The functional form of the integral will depend on the approximating function we choose for the derivative. The precise form of the integral will depend on the data.

To keep the notation limited, we’ll use $f$ to denote the function describing the derivative in the ODE. The setup is then,

\[ y’(x) = f(x, y), \ \ \ \ y(x_0) = y_0, \]

where $y(x)$ is interpreted as “the value of $y$ at $x$. From the most abstract point of view, nothing much has changed. We are still interested in finding some function - called $f$. What has changed fundamentally, however, is that now this function describes the rate of change - how $y$ changes as $x$ changes - as opposed to the direct relationship.

Why is this useful? We’ll see that approximating derivatives reduces the number of parameters, and also the number of function evaluations (computal cost) required to find the optimal parameters.

Parametric Efficiency

Let’s go back to the dataset I showed earlier in $R^2$. In the regression scenario, we made the assumption that the function we are trying to approximate is linear, i.e.

\[ y \approx f(x) = ax + b. \]

In this case there are two free parameters, $a$ and $b$.

The term free means that these parameters are allowed to vary as we try to minimise the loss. They are the parameters which we are interested in optimising.

What if we used the ODE approach instead? We know the parametric form of the derivative, which we can write as:

\[ \frac{d\widehat y}{dx} = a. \]

Now we are trying to approximate the function:

\[ \frac{dy}{dx} \approx f(x, y) = a. \]

Immediately we can see that there is only one free parameter, $a$.

If we were to solve this problem analytically, we would still end up needing to solve for the parameter $b$. However most interesting ODE problems can’t be solved analytically and require numerical methods.

These numerical methods are implicit in that they don’t analytically give you the integral but rather a set of function evaluations at future points. Thus we don’t need to care about the extra parameter $b$ at all. All we need is an initial point $(x_0, y_0)$ to get us started and any number of additional points from our data to tune the fit.

It turns out that optimising an ODE can be more computationally efficient than regression. In order to explain this, it will be helpful to look a bit more into numerical integration.

Numerical Methods for ODE’s

When we can’t find the solution to an ODE analytically, and this is often the case in practical situations, we need to resort to numerical methods which approximate a solution at discrete evaluation points.

How do we go about finding the solution to an ODE numerically? For starters the gradient of the integral we are trying to solve is available to us - in our setup it is $f(x,y)$ - and we know that the gradient traces the curvature of the function we’re trying to solve for. The picture below describes this for the parabola:

Gradients tracing the parabola.

So we can use precisely this logic to compute integrals numerically. We can start at an initial point and move in the direction of the gradient evaluated at the initial point to get to a new evaluation point. Starting at this second evaluation point we can repeat the same procedure to move on to a third evaluation point, and so on.

This is the basic idea behind Euler’s method. While it seems simple, Euler’s method is a starting point for advanced numerical methods, which build on this basic idea with more sophisticated updates at each step. Typically a single step will be composed of smaller steps, sometimes using higher order gradients where available. The more substeps that are taken, the higher the fidelity of the method.

Euler’s method falls in to two larger classes of methods: Runge-Kutte methods and Adams-Bashforth methods. In both cases, Euler’s method is the simplest order method.

Euler’s Method

To describe Euler’s method, we’ll replace the symbol for the input domain by $t$. By convention we’ll interpret $t$ as being a time element in the evolution of $y$ as we iteratively use the method. This will also tie in better with the paper. Euler’s method is derived from the basic definition of the tangential approximation of the gradient at a point:

\[ \frac{dy}{dt} \approx \frac{y(t + \delta) - y(t)}{\delta}, \]

where $\delta$ is a fixed step-size. We can rearrange this expression to get:

\[ y(t + \delta) = y(t) + \delta\frac{dy}{dt}. \]

This is an explicit algebraic description of the stepwise procedure which I described and illustrated above. We can then plug in the functional formula for the derivative to get

\[ y(t + \delta) = y(t) + \delta f(t, y). \]

Finally to compute approximations for $y$ using Euler’s method, we need to discretize the domains. Starting from an initial point $(t_0, y_0)$, we define a computation trajectory recursively as follows:

\[ t_{n+1} = t_{n} + \delta (n+1), \ \ \ \ n = 0, 1, 2, … \] \[ y_{n+1} = y_{n} + \delta f(t_{n}, y_{n}), \ \ \ \ n = 0, 1, 2, … \]

Euler’s method illustrated: at each time point, we use the current value of $y$ to calculate the next value. The gradient tells us which direction to move in and by how much.

Optimising the ODE

At this point we have a modelling assumption that

\[ \frac{dy}{dx} = f(x, y) = a, \]

and we have a numerical method for evaluating $y$ at different values of $x$. However to use Euler’s method we need to explicitly define $f(x, y)$, which requires that we have a value for $a$.

The machine learning paradigm is to treat a as a free parameter which we can optimise.

In particular we can make an initial guess for $a$ and use Euler’s method to compute the forward pass on our data. Each numerical method is different, but for Euler’s method we will compute the forward pass as follows:

Choose an initial value for $a$ and a fixed stepsize $\delta$
Choose some $(x_0, y_0)$ to be the initial value. Choose $k$ data points to evaluate at, and order your choices. Choosing $k < N$ points can reduce the scaling complexity of ordering.
Add additional points from $\mathscr{X}$ so that you can use Euler’s method at regular intervals. This is your computation trajectory.
Compute Euler’s method along your computation trajectory, also evaluating at your chosen points from your data.

You can then calculate the loss by comparing Euler’s method evaluated at the $k$ chosen datapoints with their actual values. In the backward pass you can calculate a derivative of your loss function with respect to $a$, for each evaluation point, and adjust it as you would in gradient descent.

In summary, by using Euler’s method in the context of a machine learning optimisation problem, we treat $a$ as a free parameter. Importantly $a$ is the only free parameter and can be used to describe a linear relationship which typically requires two parameters.

Computational Cost

This process is also more compuationally efficient. In the simple linear regression example we can solve for the free parameters analytically using well known formulas. However in more complex regression (and classification) problems we use gradient descent to solve for the parameters. Each step of gradient descent requires computing the loss at every datapoint. Even if we use stochastic gradient descent, we still require several epochs, passing through the whole dataset, in order to find the optimal parameters. Throughout training we ultimately rely on not only the entire dataset itself, but also on performing function evaluations on the entire dataset during the forward and backward pass.

With an ODE, we require only one initial datapoint to get started. We can then perform $N$ function evaluations at each point in the forward pass. We then need to only evaluate the loss function at a handful of data points, before deciding how to update the free parameters. We can run this for several epochs until we’re satisfied. Ultimately, solving for an ODE requires fewer function evaluations, particularly in the backward pass. In fact, one of the contributions made in the paper is to show that this tends to be emperically true for neural ODE’s as well.

If we use modern implementations in numerical ODE methods, then we’re in even better luck. These solvers are designed to adapt the amount of computation required so that the number of function evaluations can be decreased if it doesn’t impact error.

To get an intuition for why we do not need the whole dataset, consider a situation where we are trying to fit a parabola which requires two parameters to describe the derivative:

\[ \frac{dy}{dx} = ax + b. \]

After choosing an initial guess for $a$ and $b$ and an initial data point, we can run several iterations of Euler’s method and compare it to the true data:

Euler’s method diverging: simple methods like Euler’s method can easily diverge. More advanced methods will be more robust.

We only need to evaluate Euler’s method at the most recent function evaluations to determine if it is diverging. We can then use gradient descent to adjust the free parameters.

3. Neural ODE’s

So far I’ve described how we can use an ODE to solve a data modelling problem. We approximate the relationship between points in our input and output domain by optimising the functional form of the rate of change:

\[ \frac{dy}{dx} = f(x, y). \]

With deep learning we can go even further. We know that neural networks are universal function approximators. So what if we used our neural network to approximate $f$? In theory if we can approximate the derivative of any differentiable function using a neural network, then we have a powerful modelling tool at hand. We also have a lot more flexibility in the data modelling process.

In particular, consider a neural network where the hidden states all have the same dimension (this will also be helpful for visualising their evolutionary trajectories later). Each hidden state depends on a neural layer, $f$, which itself depends on (free) parameters $\theta_t$, where $t$ is the layer depth. Then

\[ h_{t+1} = f(h_{t}, \theta_{t}). \]

If we have a residual network, then this looks like

\[ h_{t+1} = h_{t} + f(h_{t}, \theta_{t}). \]

One of the intuitions which inspired the paper is that this has a similar form to Euler’s method which we described above. To understand this, remember that Euler’s method is a discretisation of the continuous relationship between the input and output domains of the data. Neural networks are also discretisations of this continuous relationship, only the discretisation is through hidden states in a latent space. Residual neural networks create a pathway through this latent space by allowing states to depend directly on each other, just like the updates in Euler’s method.

Residual neural network appears to follow the modelling pattern of an ODE: namely that the continuous relationship is modelled at the level of the derivative.

To take this logic full circle, we consider the continuous limit of each discrete layer in the network. This is the radical idea proposed by neural ODE’s. Instead of a discrete number of layers between the input and output domains, we allow the progression of the hidden states to become continuous:

\[ \frac{dh(t)}{dt} = f(t, h(t), \theta_t), \]

where $h(t)$ is the value of the hidden state evaluated for some $t$, which we understand as a continuous parametrisation of layer depth. The arena is opened up for solving the data problem as a Neural ODE. The next step is to explore this dynamic even further.

4. Continuous Hidden State Dynamics

In Euler’s method we defined a computation trajectory by starting from some initial point $(t_0, y_0)$ and recursively evaluating the ODE at fixed step sizes:

Euler’s method traces an appoximate evolution of $y$ through time.

The fixed stepsizes represent a time scale in the evolution of the ODE as we solve it numerically. This defines a dynamic of the outcome $y$ with respect to time. In fact the functional form of the ODE neatly represents this idea:

\[ \frac{dy}{dt} = f(t, y). \]

Now the key feature of a neural network is that we add in hidden states between the input and outcome:

Hidden state evaluations in a neural network: we can view the transformation made by each layer as an evolution of the hidden state through time.

This looks similar to the computation trajectory of Euler’s method. Only now the fixed stepsizes correspond to the layers in the neural network, which defines a dynamic of the hidden state with respect to depth. This dynamic can be visualised as a discrete evolution of the hidden state, evaluated at each layer:

Discrete hidden state trajectory: we can plot how the hidden state evolves through each layer in a neural network.

When we take the continuous limit of the hidden state with respect to depth, we smooth out this computation trajectory so that in theory, the hidden state can be evaluated at any “depth”, which we now consider to be continuous:

Hidden state dynamics: the continuous hiddens state trajectories will vary with the free parameters.

In a neural ODE, we parametrise this hidden state dynamic by

\[ \frac{dh(t)}{dt} = f(t, h(t), \theta_t), \]

where $f(t, h(t), \theta_t)$ is a neural network layer parametrised by $\theta_t$ at layer $t$.

We can evaluate the value of the hidden state at any depth by solving the integral

\[ h(t) = \int f(t, h(t), \theta_t)dt. \]

How do we solve this integral? Using a numerical ODE method of course. The final modelling assumption is to take the input as $h(t_0) = x$ - i.e. the initial value of the ODE - and to let the output be evaluated at some time $t_1$, so that $h(t_1) = y$. The function approximation problem now takes place over a continuous hidden state dynamic:

Function approximation over hidden state dynamics: ultimately we are finding an implicit mapping $F$ going from $\mathscr{X}$ to $\mathscr{Y}$.

How do we decide what the values for $t_0$ and $t_1$ are? Well, since this is a machine learning approach we can treat them as two additional free parameters to be optimised. We can put everything together into a numerical ODE solver of our choice, which we can just call $ODESolve$ in the spirit of the paper, so that:

\[ \widehat{y} = h(t_1) = ODESolve\big(h(t_0), t_0, t_1, \theta, f\big),
\]

where $f$ is a neural network.

5. Backpropogating Through Depth

The central innovation of the paper is an algorithm for backpropagating (reverse-mode differentiation) through the continuous hidden state dynamics. In the neural ODE, our parameters are not just $\theta_t$, but also the evalutation times $t_0$ and $t_1$. As usual we can define any (differentiable) loss function of our choice on these (free) parameters:

\[ \mathscr{L}(t_0, t_1, \theta_t) = \mathscr{L}\Big(ODESolve\big(h(t_0), t_0, t_1, \theta, f\big)\Big). \]

To optimise the loss, we require gradients with respect to the free parameters. As with the usual backpropagation algorithm for deep learning, the first step is to compute the gradient of the loss with respect to the hidden states:

\[ \frac{\partial\mathscr{L}}{\partial h(t)}.
\]

But the hidden state itself is dependent on time (depth), so we can take a derivative with respect to time. Bear in mind that this time derivative will be a reverse traversal of the hidden states, which we will need to keep track of. This is where the adjoint method comes in, which is a decades old numerical technique for efficiently computing gradients.

A proof for the adjoint method and how it is used in the paper can be found in Appendix B of the paper. For an accessible explanation of the adjoint method see this blogpost by Kevin Gibson.

To keep track of the time dynamics, we’ll define the so-called adjoint state

\[ a(t) = -\frac{\partial\mathscr{L}}{\partial h(t)}.
\]

The adjoint state represents how the loss depends on the hidden state (remember that this is continuous along a trajectory) at any time $t$. It’s time derivative is given by the following formula:

\[ \frac{da(t)}{dt} = -a(t)^{T}\frac{\partial f(t, h(t), \theta_t)}{\partial h(t)}. \]

This derivative is computable since the loss, $\mathscr{L}$, and $f$, which is precisely a neural network, are both differentiable by design. This derivative also happens to be an ODE, so we can write down the solution of the adjoint state as the integral:

\[ a(t) = \int -a(t)^{T}\frac{\partial f(t, h(t), \theta_t)}{\partial h(t)} dt, \]

or equivalently

\[ \frac{\partial\mathscr{L}}{\partial h(t)} = \int a(t)^{T}\frac{\partial f(t, h(t), \theta_t)}{\partial h(t)} dt, \]

We can solve this integral by making a different call to an ODE solver. To get the gradient at $t_0$, we can run this ODE solver backwards in time from the initial point which is known to us (just like usual backpropagation) at time $t_1$:

\[ a(t_0) = \int_{t_1}^{t_0} -a(t)^{T}\frac{\partial f(t, h(t), \theta_t)}{\partial h(t)} dt, \]

Backpropagation using the adjoint method: the adjoint state is a vector with its own time dependent dynamics, only the trajectory runs backwards in time. We can solve for the gradients at $t_0$ using an ODE solver for the adjoint time derivative, starting at $t_1$.

Now at this stage we have a way of computing the gradient with respect to $t_1$ and $t_0$. This covers two of the parameters. To compute gradients with respect to $\theta$ we solve the ODE (also proven in Appendix B):

\[ \frac{\partial\mathscr{L}}{\partial\theta} = \int_{t_1}^{t_0} a(t)^{T}\frac{\partial f(t, h(t), \theta_t)}{\partial\theta} dt. \]

Again this integral can be solved using an ODE solver. However the paper insists that we can do even better. All three integrals can be computed using only one call to an ODE solver by vectorising the problem. This is described in Appendix B.

(Optional) Augmented State Dynamics

The rest of this section is based on Appendix B of the paper. It is more mathematically involved but worth going through to understand the computational implementation of backpropagation for neural ODE’s. Feel free to skip ahead to the final section and come back to this point at a later stage.

Let $\theta$ be “independent of time” such that \[ \frac{d\theta}{dt} = 0, \] and observe that \[ \frac{dt}{dt} = 1. \]
Let $[h, \theta, t]$ represent an augmented state, then we can define the augmented state function:

$f_{aug}([h, \theta, t]) = \begin{bmatrix} f\big(t, h(t), \theta\big) \\ 0 \\ 1 \\ \end{bmatrix}.$
Then let the augmented state dynamics be given by:

$\frac{d}{dt} \bigg[ \begin{smallmatrix} h \\ \theta \\ t \\ \end{smallmatrix} \bigg] = f_{aug}([h, \theta, t]).$
This is an ODE which has augmented adjoint state:

$a_{aug} = \begin{bmatrix} a \\ a_\theta \\ a_t \\ \end{bmatrix},$

where $a$ is the adjoint for the hidden state described above:

\[ a = -\frac{\partial\mathscr{L}}{\partial h}, \]

and \[ a_\theta = \frac{\partial\mathscr{L}}{\partial\theta}, \ a_t = \frac{\partial\mathscr{L}}{\partial t}. \]

The time derivatives of this adjoint state can then be computed using the following formula (if you know anything about vector calculus, then the details can be found in Appendix B of the paper): \[ \frac{da_{aug}}{dt} = - \biggl[ a\frac{\partial f}{dh}, \ a\frac{\partial f}{d\theta}, \ a\frac{\partial f}{dt} \biggr]. \]

The entire backpropagation algorithm can now be solved by making a call to an ODE solver on the augmented state dynamics.

Tying Everything Together

Thank you for making it to this point. I hope that you now have a better understanding of how neural ODE’s can help solve your data modelling problem.

My intention is to explain intuitively how ODE’s can model a simple modelling problem, and how they can be optimised (in a simple scenario). The second half of this post extended that simple model to the neural ODE model as it’s presented in the paper. We covered how an ODE problem can be paramatrised by a neural network and how the neural network parameters can be optimised by backpropagating through the ODE using the adjoint method.

For now, it is worth reiterating the neural ODE approach to solving a data modelling problem.

We have a set of $N$ pairs of data points, $\big\{(x_1, y_1), …, (x_N, y_N)\big\}$. Given a new data point, $x^{*}$, we would like to make a prediction about it’s value $y^{*}$. We seek a functional approximation to the relationship between the input and output domains.
Instead of modelling this relationship directly, we model the derivative: \[ \frac{dy}{dx} = f(x, y). \]
We paramatrise this approximation by a neural network with hidden states $h(t)$, depending continuously on layer depth $t$, with $h(t_0) = x$ and $h(t_1) = y$. The function approximation problem is now \[ \frac{dh(t)}{dt} = f(t, h(t), \theta), \] where $f$ is a neural network.
This is an ODE describing continuous hidden state dynamics. We can solve the data modelling problem by solving this ODE. This reduces to solving the integral \[ h(t) = \int f(t, h(t), \theta) dt. \] To make predictions we solve the definite integral \[ \widehat{y} = h(t_1) = \int_{t_0}^{t_1} f(t, h(t), \theta) dt. \] The analytical solution of this integral is not available to us. Instead we can use a numerical method (available to us using modern numerical computation software) to solve the integral at the required evaluation points: \[ \widehat{y} = h(t_1) = ODESolve(h(t_0), t_0, t_1, \theta, f). \]
The free parameters of this problem are $t_0$, $t_1$ and $\theta$. We optimise our choice of these free parameters by backpropagating through the ODE solver using the method of adjoints.

Photo by Randall Ruiz on Unsplash

Probability Part 2: Conditional Probability

2018-12-23T00:00:00+00:00

This is the second in a series of blogposts which I am writing about probability. In this post I introduce the fundamental concept of conditional probability, which allows us to include additional information into our probability calculations. The ideas behind conditional probability lead naturally to the most important idea in probability theory, known as Bayes Theorem. As with the first post, which you can read here, the approach is to explain conditional probability using mathematical ideas from measure theory. Again, as with the first post, this post is meant to be accesible to anyone, regardless of whether you’ve studied maths or not.

A Game of Dice

In my last post I discussed probability as a way of the measuring the uncertainty of an event. We thought of probability as assigning a weight to the chance that an event has a particular outcome, which we measured relative to other possible outcomes. We defined an event space, which we called $\Omega$. On this space of events we defined a random variable, $X$, as a function which maps each event to our observed outcomes, encoded as numbers on the real line, $\mathbb{R}$:

\[ X:\Omega \to \mathbb{R}. \]

The probability of an outcome is then the size of the number of possible events which are mapped to that outcome, relative to the size of the whole event space:

\[ \mathbb{P}:\mathscr{F} \to [0,1]. \]

We called the set of possible events the preimage of the random variable. If $B$ is an outcome, then the preimage of $B$ is $X^{-1}(B)$ and the probability of $B$ is the relative size of the preimage, given by $\mathbb{P} \big((X^{-1}(B)\big)$.

For instance, in the dice example from the last post where we considered the experiment of rolling two die and adding the numbers, if we observe an outcome $9$, then there were four possible events which could add up to 9, namely $(3,6)$, $(4,5)$, $(5,4)$ and ($6,3)$. We called this set $A$. Since there are $36$ possible outcomes in $\Omega$, we can then measure the probability of getting a $9$ by the relative size of $A$ with respect to $\Omega$. If $X$ is the random variable representing our experiment, then $A$ is the preimage of the observation $X = 9$, and the probability of A is

\[ \mathbb{P}(A) = \frac{4}{36} = \frac{1}{9}. \]

Thinking About the Past

However, what if we were able to incorporate information about events that have happened before? For example, suppose that instead of rolling the two die at once, we roll them one at a time. In this case, depending on the outcome of the first roll, the probability of getting a $9$ will be different. To see how, let’s consider an example where our first roll is a $3$. In the first scenario there was a total of 36 possibile configurations before any roll was made. However, now that we’ve observed one of the die as a $3$, the total number of configurations is diminished,namely the set:

\[ B = \big\{(3,1), (3,2), (3,3), (3,4), (3,5), (3,6) \big\}. \]

In this case there is only one possible event leading to an outcome of $9$: the event that we get a $6$ on the second roll. How do we calculate the probability of $9$ now? By the end of this blogpost we’ll see that calculating the total uncertainty will require a fundamental and important result in probability theory, known as Bayes’ Theorem. We’ll work our way towards that point, but for now we’ll start with the simplest scenario.

Using What We Already Know

We’ve reduced the problem to the event that we have already rolled a $3$, so all of our uncertaintly only lies with the second roll. In this case we’ve seen that there are only $6$ possible events in total with only one of them leading to a $9$. Using the same logic from the previous post, the probability of rolling a $9$, given that we have already rolled a $3$ is:

\[ \mathbb{P}(\text{observe a }9 | \text{rolled a }3) = \frac{1}{6}. \]

The first thing to notice is that this probability is higher than the totally uncertain scenario, where $\mathbb{P}(A) = 1/9$. This makes sense since rolling a $3$ greatly increases our chances of getting a $9$, even though there is only one valid configuration out of $6$. To understand this better, suppose that we do not roll a $\ 3$ on the first roll, then our space of possible events, which we’ll call $\Theta$ (capital Greek letter “theta”) has $30$ possible configurations. On this space there are $3$ possible configurations leading to a 9, namely $(4, 5), (5, 4)$ and $(6, 3)$. So the probability of getting a $9$ given that we do not roll a three on the first roll is:

\[ \mathbb{P}(\text{observe a }9 | \text{have not rolled a }3) = \frac{3}{30} = \frac{1}{10}, \]

which is less than the totally uncertain scenario, and even smaller than the scenario where we roll a $3$ on the first throw. Clearly rolling a $3$ on the first roll increases our chances of getting a $9$.

In fact, it is more general than that. Any roll that increases our chances of landing in $A$, which is after all the preimage of $X = 9$, will increase our chances. This fundamental idea can be explained graphically.

The Shape of Probability

Let’s go back to an earlier statement where I said that the probability of an outcome $B$ is given by the relative size of it’s preimage:

\[ \mathbb{P}(B) = \frac{size\big(X^{-1}(B)\big)}{size(\Omega)}. \]

Let’s unpack what this means graphically. We can visualise the events in $\Omega$ as shaded regions - for example the set $X^{-1}(B)$, which for simplicity we simply label as $B$, is shown below:

While these visualisations are just a heuristic and are too imprecise for measuring probabilities, they are helpful for visualising how different events interact, and hence how their probabilities may interact.

For example, if $A$ is the event our rolls add up to $9$, we can include that in our diagram and it is now clear that this event intersects $B$ at exactly one point: the pair $(3, 6)$.

This was the one configuration leading to $9$, with a $3$ on the first roll. However, since this a point in $\Omega$, why don’t we measure it’s probability as $\mathbb{P}\big((3,6)\big)= 1/36$? That would depend on what we were looking for. In particular we were interested in the probability of $9$, given that we have already rolled a $3$. If we have already rolled a $3$, then we are no longer in the whole space $\Omega$.

Configurations such as $(2, 4)$, $(1, 5)$ and $(4, 1)$ are no longer available to us, so clearly we have to redefine our space of interest.

Revaluating Size

To see how to do so, let’s go back to the diagram of $A$ and $B$. If we have already rolled a $3$, the possible configurations available to us have shrunk from the whole of $\Omega$ to just the set $B$! This shrinking action also excludes some of the configurations in $A$, all except for $(3, 6)$, which is in the intersection of $A$ and $B$. We can think of this visually as:

Now, when we take the probability of $9$, given that we have rolled a $3$, we have to revaluate how we measure size in context of what is now possible. The preimage of $9$ is now just the intersection. Since the event $B$ is assumed to have already happened (as visualised by the shrinking of the event space to $B$), we measure the size of the preimage relative to $B$, so that

\[\begin{align} \mathbb{P}(\text{observe a }9 | \text{rolled a }3) & = \mathbb{P}(A | B) \\ & \\ & = \dfrac{size(A\cap B)}{size(B)} \\ & \\ & = \dfrac{1}{6}. \end{align}\]

Conditional Probability

What have just seen is known as the condtional probability of an event. It is the probability of the event $A$, conditional on the event $B$. By thinking of conditioning as a restriction on the size of the event space, we can measure the conditional probability of $A$ given $B$ as

\[ \mathbb{P}(A| B) = \frac{size(A\cap B)}{size(B)}. \]

We can make this even more intuitive by remembering that the probability of any event is given by the size of that event relative to the whole set, namely:

\[ \mathbb{P}(B) = \frac{size(B)}{size(\Omega)}. \]

A slight hand rearrangement gives:

\[ size(B) = size(\Omega)\mathbb{P}(B), \]

which we can do since $size(\Omega)$ is always positive. We can get a similar formula for the set $A\cap B$. Plugging both formulas back into our expression for the conditional probability gives

\[ \mathbb{P}(A| B) = \frac{size(\Omega)\mathbb{P}(A\cap B)}{size(\Omega)\mathbb{P}(B)}, \]

which simplifies to

\[ \mathbb{P}(A| B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}. \]

Tying Back to Intuition

Intuitively this says that the probability of $A$ given $B$ is the probability of $A$ and $B$ divided by the probability of just $B$. By deriving this formula in the example, we understand this division as the shrinking of the event space. Mathematically this shrinking action is referred to as the projection of the event space onto the conditioning space.

Immediately we know from elementary mathematics that the probability of $B$ cannot be zero to avoid a division error. Indeed this reinforces our intution since we cannot condition on an impossible event!

Finally we can view the intersection as a restriction of our event of interest to the conditioning event. Indeed, the intersection of $A$ and $B$ can be thought of as the projection of the set $A$ onto the set $B$, which is the conditioning space.

All in all, we can think of conditional probability as probability which is projected onto some (smaller) conditioning space.

Bayes’ Theorem: The Fundamental Property of Probability

It turns out that the intersection is symmetric: the projection of $A$ onto $B$ is identical to the projection of $B$ onto $A$. Indeed the diagrams from earlier reinforce this.

Based on the same procedure we could have just as easily derived the conditional probability

\[ \mathbb{P}(B| A) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}. \]

Once again, since $\mathbb{P}(A)$ is positive and cannot be zero, we can use a mathematical slight of hand to derive an expression for the probability of the intersection:

\[ \mathbb{P}(A\cap B) = \mathbb{P}(B| A)\mathbb{P}(A). \]

This says that the probability of $A$ and $B$ is the conditional probability of $B$ given $A$ times the probability of just $A$.

Intuitively if we take the conditional probability of $B$ given that $A$ has already happened, and we factor in the probability of $A$, we must be left with the probability of both.

But we know from earlier that if we have the probability of the intersection, we can project the conditioning space of $B$ to calculate the inverse probability: the conditional probability of $A$ given $B$, which is given by:

\[\begin{align} \mathbb{P}(A | B) & = \dfrac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)} \\ & \\ & = \dfrac{\mathbb{P}(B | A)\mathbb{P}(A)}{\mathbb{P}(B)}. \end{align}\]

This final result is known as Bayes’ Theorem.

In this formulation, if $A$ is our event of interest and $B$ is the conditioning event, then the quantity

\[ \mathbb{P}(A \vert B) \]

is known as the posterior probability of $A$ given $B$.

The quantity

\[ \mathbb{P}(B \vert A) \]

is known as the likelihood of $A$ given $B$.

Finally the quantity

\[ \mathbb{P}(A) \]

is simply known as the prior probability of $A$.

Intuitively the prior of $A$ is the raw probability of $A$ before any other event is taken into consideration. The likelihood, although taken as a conditional probability of $B$, can be thought of as a measure of how likely it is that $A$ depends on $B$. To understand why, recall that we derived the likelihood’s place in the formula by considering the symmetry of the intersection. Finally the posterior is the final probability of $A$ after conditioning on $\ B$.

The Importance of Bayes’ Theorem

Fundamentally, Bayes’ Theorem gives us a way of measuring conditional probability by taking into consideration our total uncertainty. This is the point I promised to get earlier this post. In the example, instead of considering each roll one at a time, then enumerating how the total event space changes after the first roll lands on a $3$ we can measure $\mathbb{P}(\text{observe a }9 \vert \text{rolled a }3)$ directly by considering all of the quantities in Bayes’ Theorem.

I’ll leave this as an exercise but by thinking carefully about the likelihood (hint: recall how we derived the likelihood using the symmetry of the intersection), you can calculate the posterior and see that it yields the same result.

Although it is not obvious from the elementary example which I used to derive conditional probability using measure theoretic intuition, there are situations where computing likelihoods is much easier than directly measuring the posterior of interest. This post was intended as an introduction to conditional probability, so I will not go into the details here. However, for those interested I suggest reading Chapter 3 of the late David MacKay’s book Information Theory, Inference and Learning Algorithms, which is available for free here.

Although ITILA (as it is abbreviated to) has been superseded by more modern machine learning textbooks (which, considering the pace of research in machine learning, don’t have long either before they are out of date), it still remains a fundamental reference for learning algorithms, especially from a probabilistic point of view.

What About Casuality?

There is a natural tendency to think of conditional probability as specifying a causal relationship. If we condition $A$ on $B$, does this mean that $B$ causes $A$. In general this is not the case and it is dangerous to fall into this trap. Indeed, in the derivation I argued that the conditioning action is somewhat symmetric in the sense that if we can condition on $B$, then we can also condition on $A$. In most cases this is true. In fact Bayes’ Theorem demands that we think of conditional probabilities (the posterior) as depending on their conditional inverse (the likelihood).

Causality is one of the least well understood concepts in statistics. It requires that we rethink how statistical relationships are modelled. In the future, depending on how well I can understand the basic ideas, I hope to write a series of similar blogposts on causal inference.

Bonus

The cover image for this post hints that conditional probability allows us to introduce a time element into our calculations. Indeed, given a sequence of events, we can take measurements at each event to get a random process $\big\{X_1, X_2, … , X_t, …\big\}$. We can measure then measure the value of the random variable $X_t$ at time $t$ by considering what has already occured in the past. We are therefore interested in the conditional probability

\[ \mathbb{P}(X_t | \ X_1, X_2, …, X_{t-1}). \]

Considering how many events in the past are relevant to the present time $t$ can help simplify our computations of the process as it evolves in time. If $X_t$ only depends on its last value at time $t-1$, then the random process, $\big\{X_t\big\}_{t \ > \ 0 \ }$, is known as a Markov Chain. We write this as

\[ \mathbb{P}(X_t | \ X_1, X_2, …, X_{t-1}) = \mathbb{P}(X_t | \ X_{t-1}). \]

Usually the time element is taken to be discrete.

Photo by Lukas Blazek on Unsplash

Probability Part 1: Probability for Everyone, a.s.

2017-11-23T00:00:00+00:00

Inspired by a course which I am taking in probability theory, this blogpost is an attempt to explain the fundamentals of the mathematical theory of probability at an intuitive level. As the title suggests, this post is pretty much intended for everyone, regardless of mathematical level or ability. There will be some mathematics, but feel free to skip through these sections. The important stuff comes in between the mathematics.

Chance and Nature

If you’re reading this post, then at some point you’ve encountered chance. Chance is intertwined with nature, just as the planets revolve around the sun or as our genetic code unwinds and transforms from one generation to the next. But unlike the laws of motion of the universe, chance is uncertain. Every time you’ve played a game involving the roll of dice, or forgot to check the weather before stepping outside, or clicked next on your music streaming service, you’ve taken a bet against chance. All without knowing with certainty what hand the universe will play.

Chance, as we know it, is the possibility that something will happen - say that you roll 6 or that the sun will be shining on your way home*. Probability is the measure of the likelihood that something will happen - i.e. the probability that you roll a 6 is 1 in 6 (assuming of course that the dice is as you would expect).

In everyday terms, probability is our best guess at trying to predict what nature, and let’s face it, what humanity has in store. For example, you might be wondering what the probability is that today you will have three glorious looking donuts, coated in sprinkles (like the ones above).

Probability theory is our attempt to measure uncertainty in the universe.

* This is a perfect example of how the universe plays its hand. When I started this post this morning, the sun was shining as bright as can be. The whole day was like this, except on my way home when lo and behold a runaway cloud let loose. Unexpected!

Measuring the Universe

Mathematical studies of chance and probability date back to 16th century Italy and France. During this time, the Renaissance mathematicians Geralamo Cardano, Pierre de Fermat and Blaise Pascal were mostly concerned with calculating their odds of winning games of dice. However, the modern theory of probability as we know it, was advanced during the 20th century following the development of measure theory.

The goal of measure theory was to construct a systematic method in which to compare the relative sizes of sets in mathematics. Since the probability of an event occuring is precisely a measure of the likelihood of that event occuring, measure theory became a natural framework for studying probability.

The problem with measure theory (for our purposes) is that it becomes gloriously abstract very quickly. Nonetheless, I will try and introduce the mathematical framework under which probability is studied, without going too deep into the measure theory. Let’s see how far we can get.

A Motivating Example

Consider a game of throwing two fair dice. We would like to measure the likelihood that any roll of the dice will add up to a given number. For example, if we roll a $3$ and $6$, we get $9$. Mathematically we can represent this as a function, $X$, taking the pair $(3,6)$ to $9$:

\[
X(3,6) \to 9 \]

We would like to consider the set of all such possible combinations of dice throws. We call this set $\Omega$ (capital omega), which we can enumerate compactly as

\[ \Omega = \big\{(i,j): 1\leq i,j \leq6\big\} \]

The pair $(3,6)$ is then a member of $\ \Omega$, which we write: $(3,6) \in \Omega$.

However, if we consider the event that a given throw of dice sums to $9$, then there is more than one possible combination, i.e. the pairs:

\[ A = \big\{(3,6), (4,5), (5,4), (6,3) \big\} \]

Evaluating $X$ (sampling) on any of the pairs in the above set will give us $9$. Notice that the set A is not an element (member) of $\Omega$, but it clearly contains elements of $\Omega$! So we need something based on $\Omega$, only more expansive.

This motivates the concept of a family of events*, $\mathscr{F}$, which contains all possible combinations of elements from $\Omega$, including the whole of $\Omega$ itself, and the empty set, denoted $\emptyset$. In essence, $\mathscr{F}$ is a set of sets, so that for consistency, a single roll takes the form: $\{(3,6)\}$.

The set $A$ is then an element of $\mathscr{F}$, i.e. $A \in \mathscr{F}$.

Now that we have a set of events, we can proceed to our goal of measuring the likelihood of these events. The mathematically precise manner in which we do this is fairly involved, but intuitively (and perhaps even realistically), we can think of the probability of an event as the likelihood of that event relative to the overall possibility of something happening. To this end, if $F \in \mathscr{F}$ is any event, then one way of calculating its probability is:

\[ \mathbb{P}(F) = \frac{size(F)}{size(\Omega)} \]

The ambiguity of course lies in what we mean precisely by the size of $F$ or $\Omega$, but this is a debate for another day! In our simple example, this is fairly easy:

In the case of one die, there are 6 outcomes of a roll, let’s call them $\Omega_1 = \{1,2,3,4,5,6\}$. There is only one way of rolling a $1$, so \[ \mathbb{P}(1) = \frac{size\big(\{1\}\big)}{size(\Omega_1)} = \frac{1}{6} \] On $\Omega_1$, an example of an element of the family of events is $\{1,2\}$, which we can interpret as rolling $1$ or $\ 2$, which has probability \[ \mathbb{P}\big(\{1,2\}\big) = \frac{size\big(\{1,2\}\big)}{size(\Omega_1)} = \frac{2}{6} = \frac{1}{3} \]
In the case of two dice, there are $6^{2} = 36$ possible outcomes, with $\Omega$ defined compactly as above. Then \[ \mathbb{P}\big((1,2)\big) = \frac{size\big(\{(1,2)\}\big)}{size(\Omega)} = \frac{1}{36} \] and taking $A$ as above: \[ \mathbb{P}(A) = \frac{size(A)}{size(\Omega)} = \frac{4}{36} = \frac{1}{9} \]

Equipped with a set of outcomes, $\Omega$, a family of events, $\mathscr{F}$, and a measure of probability, $\mathbb{P}$, we have almost everything we need to lay out the fundamental ideas of probability theory. The missing ingredient is the function we defined at the beginning of this section: $X$. This function is a tricky object and handling it requires some care. In other words a bit more math, which we will cover shortly. For now, I will claim that we can use $X$ to encode the randomness of the example game.

* Technically $\mathscr{F}$ is what we call a $\sigma$-algebra on $\Omega$.

The Setting

The motivating example has given us enough material to set the scene more generally.

The triple $\big(\Omega, \mathscr{F}, \mathbb{P}\big)$ is known as a probability space if $\ \mathbb{P}(\Omega) = 1$. In which case the map which sends elements of $\mathscr{F}$ to numbers in the interval $[0,1]$,

\[ \mathbb{P}:\mathscr{F} \to [0,1]
\]

is known as a probability measure.

The requirement that $\ \mathbb{P}(\Omega) = 1$ reflects the logic that in the world of $\big(\Omega, \mathscr{F}, \mathbb{P}\big)$ something has got to happen from $\Omega$.

The elements $A \in \mathscr{F}$ are known as events, while the elements $\omega$ (lowercase omega) $\in \Omega$ (equivalently $\{\omega\} \in \mathscr{F}$) are called elementary events (or realisations).

Events in $\mathscr{F}$ are made up of elementary events (or the lack thereof, e.g. the event that you don’t roll two 6’s).

The next step is to define the notion of a random variable. Our intuition says that this must be some kind of variable that takes its values at random. Mathematically, this isn’t as obvious to define. The formal definition will follow our intuition in the following way:

When we measure the probability of an event, we start by measuring how likely we are to observe an event.

The Preimage

In the case of the dice example, we can observe the sum of two dice being $9$ in four ways, listed by the elements of the set $A$. We will do the same thing in general by starting with a set of outcomes and measuring the size of the set of possibilities that could lead to that outcome.

We formalise this as follows: let $X:\Omega \to \mathbb{R}$ be a function which sends the elementary events from $\Omega$ to the real line (i.e. all the usuals numbers that we are familiar with). Then the function $X$ will be a random variable if the preimage of any subset $B \subset \mathbb{R}$ (the symbol $\subset$ is used to indicate that $B$ is a subset of $\mathbb{R}$) is contained in $\mathscr{F}$.

A function’s preimage of a given set, $B$, is the set of elements which it sends to $B$. For $X:\Omega \to \mathbb{R}$, this is defined as: \[ X^{-1}(B) = \{\omega \in \Omega : X(\omega) \subset B\} \] where $B \subset \mathbb{R}$.

For instance, if we return to the function $X$ from our working example of a game of two dice. Then the preimage of $9$ (taking $\{9\}$ as a subset of $\mathbb{R}$) is

\[\begin{align} X^{-1}\big(\{9\}\big) & = \big\{(i,j) \in \Omega : X\big((i,j)\big) = 9\big\} \\ & = \big\{(i,j) \in \Omega : i + j = 9\big\} \\ & = \big\{(3,6), (4,5), (5,4), (6,3) \big\} \\ & = A \end{align}\]

We know that $A \in \mathscr{F}$. So far, so good. In fact, since we include the empty set, $\emptyset$, as part of $\mathscr{F}$, it turns out that the preimage of any subset of $\mathbb{R}$ is in $\mathscr{F}$. So $X$ satisfies the condition* for it to be a random variable.

* In reality this condition is a technical requirement which we don’t have to worry about in most (computational) settings. Rather we will be more concerned if the setting we end up using makes sense.

Random Variables and Randomness

Relating the maths back to our intuition, how do we interpret the formal condition that $X:\Omega \to \mathbb{R}$ is a random variable if $X^{-1}(B) \in \mathscr{F}$ for any subset $B \subset \mathbb{R}$ of real numbers? We can think of this in the following way:

If we make some observation, $B$, which is quantifiable in $\mathbb{R}$, then $X^{-1}(B)$ is the set of possible events, through which the random variable $X$ leads to $B$. In other words, the releative size $X^{-1}(B)$ is the chance of $B$ happening as governed by the random generating mechanism of $X$.

In fact, this is precisely what we do when we measure the likelihood of $B$ using the probability measure:

\[ \mathbb{P}\big(X^{-1}(B)\big) = \mathbb{P}\big(\{\omega \in \Omega : X(\omega) \subset B\}\big) \]

Motivated by this equivalence, as well as the desire for intuitive coherence (usually we don’t think of events in terms of preimages!), we can simply write

\[ \mathbb{P}\big(X \subset B\big) \ \ \text{in place of} \ \ \mathbb{P}\big(X^{-1}(B)\big) \]

We can read this representation as the probability that the random variable $X$ lands in $B$.

However, the set containment symbol looks a little unwieldy and is just there for technical completeness. We can make things easier to read by writing

\[ \mathbb{P}\big(X = B\big) \ \ \text{in place of} \ \ \mathbb{P}\big(X \subset B\big) \]

We read this as the probability that $X$ equals $B$, which simply means the probability that the random variable $X$ takes on the value $B$.

Returning to our example, if we take $B = \{9\}$, and then simply writing $9$ instead of $\{9\}$, we can recover the usual notation for the probability of rolling two numbers in a game of dice which sum to $9$:

\[ \mathbb{P}\big(X = 9\big) = \frac{1}{9} \]

Finally we can take a more direct approach to measuring probability by beginning with the event of an interest, say some $A \in \mathscr{F}$ which generates some observation. To measure its likelihood we start by evaluating the random variable on this event:

\[ B = X(A) = \{X(\omega): \omega \in A\} \]

Then we calculate:

\[ \mathbb{P}(A) = \mathbb{P}\big(X(A) = B\big) = \mathbb{P}(X = B) = \mathbb{P}(B) \]

choosing whichever notation is more clear, depending on the context.

Bonus

Given any set $\mathscr{U}$, we say that $\mathscr{U}$ happens almost surely (a.s.) if $\mathbb{P}\big(\mathscr{U}\big) = 1 $. This explains the title.

Photo by Patrick Fore on Unsplash

The Perceptron

2017-11-11T00:00:00+00:00

This is the second post in a series dedicated to the history of Artificial Neural Networks. Read the first post on the MCP Neuron here. For an accompanying Jupyter notebook with a Python implementation of the Perceptron, go here

In my last post, I introduced the MCP Neuron, the first biologically inspired algorithm. The MCP Neuron was a significant first step in artificial intelligence as it could model the AND, OR and NOT logic gates using an algorithm inspired directly by the neurons found in the brain. This post is about the Perceptron, a natural evolution of the MCP Neuron, which incorporated an early version of a learning algorithm.

As a reminder, neurons in the brain are connected to each other to form a neural network. In this network, a single biological neuron can receive signals from other neurons. If the combined intensity of these signals is sufficient, the neuron will fire off another signal to other neurons in the network.

The MCP Neuron modelled this dynamic by summing a vector of input signals, $[x_1, x_2, … , x_m]$ with values $1$ or $0$, together with a vector of corresponding weights, $[w_1, w_2, … , w_m]$ with values $1$,$-1$ or $0$. If the weighted sum of the signals exceeded some threshold value, $t$, the model outputs a positive signal, otherwise it outputs a null signal. I.e.

\[ y = \begin{cases} 1, & \text{if} \ \sum_{i=1}^{m}w_{i}x_{i} \ \geq \ t, \\ 0, & \text{otherwise} \end{cases} \]

As an example, to model the AND gate for two input signals, we set the weights to $[1,1]$ and the threshold value to $2$ so that all input signals must be $1$ for the neuron to fire a positive output.

However, the problem with the MCP Neuron is that the weights had to be pre-programmed for each logic gate. Furthermore, its limited use of integers restricted the model to basic logic gates. This problem would be addressed in 1957, when the psychologist Frank Rosenblatt extended the MCP Neuron to include a learning algorithm which could automatically figure out the correct weights. He called this model the Perceptron.

Learning Algorithms: A Brief Primer

The Perceptron is an example of a supervised learning algorithm. But before we look at the Perceptron, what is a learning algorithm, and what does it mean for a learning algorithm to be supervised?

A learning algorithm is, roughly speaking, a method which adapts its computation units (for example weights in a sum) in order to achieve a desired behaviour. This is known as training.

Typically, learning algorithms are presented with examples of input data, with their correct output data, during training. This is known as supervised learning.

The class of learning algorithms that the Perceptron belongs to is known as a binary classifier. Binary classifiers learn to group data into one of two classes, typically referred to as the positive class and the negative class. In the supervised learning setting, input data used during training is already labelled as positive ($1$), or negative ($-1$). During training, binary classifiers learn a decision boundary which separates the data into the two classes.

Below is an example of a 1D decision boundary (the simplest case) on the left, and a 2D decision boundary on the right.

Linear decision boundaries like the examples above can be parametrised by a vector of weights, just like those in the MCP Neuron. But how does a learning algorithm figure out the correct vector of weights to fit the decision boundary?

The basic idea is illustrated below:

Initialise the weights either to 0 or small random numbers.

For each training example, $x_n$, compute the predicted class, $\hat{y_n}$, using the weighted sum: \[ \hat{y_n} = \begin{cases} 1, & \ \mathbf{w}^{T}\mathbf{x}_{n} \ \geq \ 0, \\ -1, & \ \text{otherwise} \end{cases} \]

If a training example is misclassified, adjust the weights a little.

Repeat 2 - 3, until the number of misclassifications is reduced to a minimum.

Note the notation from linear algebra: $\mathbf{w} = [w_1, w_2, … , w_m]$, $\mathbf{x} = [x_1, x_2, … , x_m]$, and $\mathbf{w}^{T}\mathbf{x} = \sum_{i=1}^{m}w_{i}x_{i}$.

We will now make this more concrete by looking at the Perceptron learning algorithm.

The Perceptron

The Perceptron algorithm works by comparing the true label, $y_n$, with the predicted label, $\hat{y_n}$, for every training example, and then updating the weights according to whether or not the weights are too small or too large.

How do we measure this comparison? Since the labels are either $1$ or $-1$, one intuitive way is to note that the difference of the predicted label and the true label is $0$ when the label is predicted correctly, and $\pm2$, when the label is predicted incorrectly:

\[ y_{n} - \hat{y_n} = \begin{cases} 0 &, \ y_n \ = \ \hat{y_n}, \\ 2 &, \ y_n \ = 1, \ \hat{y_n} = -1, \\ -2 &, \ y_n \ = -1, \ \hat{y_n} = 1 \end{cases} \]

Since the value for $\hat{y_n}$, depends on how large (positive) or small (negative) the weights vector is, we need to reduce $\mathbf{w}$ when it is too large and increase it when it is too small. Using the cases above, this motivates the update rule:

\[ \mathbf{w} = \mathbf{w} + \alpha(y_n - \hat{y_n})\mathbf{x_n}
\]

There are three important factors in this update rule:

The parameter, $\alpha > 0$, is called the learning rate, which as the name suggests can be used to tune how quickly or slowly $\mathbf{w}$ is updated.
The difference $(y_n - \hat{y_n})$ will increase the weights by a factor of $2$ when we misclassify a positive example, and decrease the weights by a factor of $-2$ when we misclassify a negative example.
Finally, notice that the weight update is proportional on the value of $\mathbf{x_n}$, which ensures that we move the decision boundary to a lesser degree when $\mathbf{w}^{T}\mathbf{x}$ is closer to zero.

For a visual explanation of why the update rule works, consider the simple 2D case given below, with initial weights vector $\mathbf{w} = [1, 1]$ and learning rate $\alpha = 0.2$, and only one misclassified example.

In this scenario, the training example, $\mathbf{x^* } = [-1, 2]$, $y^* = -1$, lies on the wrong side of the decision boundary , and indeed, $\mathbf{w}^{T}\mathbf{x}^* = 1 > 0$, so $\hat{y^*} = 1$. Then the update is:

\[\begin{align} \mathbf{w} & = \mathbf{w} - \alpha\big(y^* - \hat{y^* }\big)\mathbf{x}^* \\ & = \begin{bmatrix}1 \\ 1\end{bmatrix} - 0.2\big(-2\big)\begin{bmatrix}-1 \\ 2\end{bmatrix} \\ & = \begin{bmatrix}1,4 \\ 0,2\end{bmatrix} \end{align}\]

With the new decision boundary, $\mathbf{w}^{T}\mathbf{x}^* = -1 < 0$, so $\hat{y^*} = -1$, as required.

The Learning Rate

In the example above, the learning rate $\alpha = 0.2$ was chosen so that the Perceptron correctly classified $\mathbf{x^* }$ after just one update. To get an understanding of how this parameter affects convergence of the learning algorithm, we can look at what happens when we choose other values for the learning rate:

If we set $\alpha = 0.05$, then the updated weights vector would be $[1.1 ,0.8]$. The decision boundary is only rotated by a small amount, and thus the point $\mathbf{x^* } = [-1,2]$ is still misclassified.

If we set $\alpha = 0,5$, the updated weights vector is now $[2,-1]$. The decision boundary has rotated by such a large amount, that although it correctly classifies $\mathbf{x^* }$, it incorrectly classifies other points.

In summary, if the learning rate is too small, the learning algorithm may fail to converge in a reasonable amount of time. If the learning rate is too large, the learning algorithm may fail to settle on a good decision boundary at all! Thus we can see that the learning rate, $\alpha$, is an important parameter in the learning algorithm, and is vital to the success of the Perceptron.

Conclusion

Introduced in 1957, the Perceptron introduced an important innovation to the field of artificial intelligence: learning algorithms. The basic premise of the Perceptron’s learning algorithm is as follows:

If a point is classified correctly, do nothing.
If a point is misclassified, adjust the Perceptron’s decision boundary until the point is classified correctly.
Do this for all points, until settling on a decision boundary which minimises the number of misclassified points, possibly zero of them.

However, the Perceptron suffered from two things:

It could only fit simple, linear decision boundaries.
Its learning capabilities were notoriously oversold by Rosenblatt, with the New York Times reporting that it was

“the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The first weakness (at least) meant that it wasn’t long before the Perceptron was shown to be incapable of recognising simple non-linear patterns. Ultimately the Perceptron met its demise at the hands of the revered computer scientists Marvin Minksy and Seymour Papert in their infamous book, titled Perceptrons and released in 1969. It is believed that the criticism of Perceptrons in this book, and their extension to neural networks (explored in the next post) contributed to the so-called AI Winter - a period of reduced funding and activity in artificial intelligence researched that spanned the 1970’s, 1980’s and 1990’s.

However, as we shall explore in the next post, artificial intelligence was far from doomed, as during this period a small group of “rebel” scientists continued to work on a new type of classifier: the artificial neural network.

Special thanks go to Sebastian Raschka and Andrey Kurenkov for the inspiration, and to Andrew Ng for his passion and dedication to the field of Deep Learning. The style of this blogpost is intended to be conversational and informal. For a more formal treatment of the mathematics and code, checkout the Jupyter notebook version on Github here.

Photo by Len dela Cruz on Unsplash

The MCP Neuron

2017-09-24T00:00:00+00:00

Artificial intelligence is an incredibly exciting area of research and development which spans mathematics, statistics, computer science, engineering, philosophy, linguistics, information theory, biology, pyschology, neuroscience and others. It is also a fairly nascent area of science - what has been achieved so far are just the first steps in the journey to achieving general artificial intelligence.

The success, however, of deep learning in image recognition, natural language and games, has inspired the world to take note of artificial intelligence. This success has fueled a wave of media hype and attention, that has perhaps misinterpreted what AI is for what it isn’t. AI as we know it today is not capable of thought, has no consciousness and certainly does not have any sort of intelligence that can surpass our own. Rather, the “AI” that we experience in our mobile phones, the internet or read about in the news, is a collection of computational and statistical techniques known as deep learning (or machine learning, depending on the scope).

So what is deep learning? For the best explanation of deep learning and it’s limits, I recommend this excellent two-part post by Francois Chollet, the creator of Keras. In summary, deep learning is a sequence of geometric transformations (linear and non-linear), that when applied to data, may be able to statistically model the relationships contained in that data. These geometric transformations are organised in a layered network, known as a neural network. This series is about these so called artificial neural networks, in which I will attempt to uncover what they are, how they work, where they come and why they are called “neural networks”.

The MCP Neuron

Before there were any artificial neural networks, or even the perceptron (more on both in upcoming posts!), there was the MCP Neuron. First proposed in 1943 by the neurophysiologist Walter S. McCulloch and the logician Walter Pitts, the McCulloch-Pitts (MCP) neuron is a simple mathematical model of a biological neuron.

To understand how this model works, let’s begin with a very simplified (and certainly non-expert) explanation of a biological neuron. These are electrically excitable, interconnected nerve cells in the brain which process and transmit information through electrical and chemical signals. These neurons are all connected with each other to form a neural network in the brain. The connections between neurons are known as synapses. Now a single neuron in this simplified explanation consists of three parts:

Soma: this is the main part of the neuron which processes signals.
Dendrites: these are branch-like shapes which receive signals from other neurons.
Axon: this a single nerve which sends signals to other neurons.

The picture looks something like this:

At its most basic, a single biological neuron may receive multiple signals from other neurons via its dendrites. These signals are then combined in the soma, and this combination may fire off another signal from the neuron to other neurons, which is propagated via the axon.

The idea behind the MCP neuron is to abstract the biological neuron described above into a simple mathematical model. The neuron receives incoming signals as $1$’s and $0$’s, takes a weighted sum of these signals and outputs a $1$ if the weighted sum is at least some threshold value or a $0$ otherwise. Formally this mathematical model can be specified as follows:

Let $[x_1, x_2, … , x_m]$ be a vector of input signals where each $x_i$ has a value of $1$ or $0$.

Let $[w_1, w_2, … , w_m]$ be a vector of weights corresponding to the input signals where each $w_i$ has a value of $1$, $-1$ or $0$.

Input signals with a weight of $1$ are called excitatory since they contribute towards a positive output signal in the sum.

Input signals with a weight of $-1$ are called inhibitory since they repress a positive output signal in the sum.

Input signals with a weight of $0$ do not contribute at all to the neuron.

Then for some threshold value $t$, an integer, the output signal is determined by the following activation function:

\[ y = \begin{cases} 1, & \text{if} \ \sum_{i=1}^{m}w_{i}x_{i} \ \geq \ t, \\ 0, & \text{otherwise} \end{cases} \]

The neuron is said to be “activated” when the weighted sum is greater than the threshold value.

The MCP Neuron is illustrated below. Note that the $x_i$ input signals are analogous to the dendrites, the activation function is analogous to the soma and the output is analogous to the axon.

McCulloch’s and Pitt’s original experiment was to see if they could use this model to construct different logic gates by simply specifying what the weights and threshold should be. In the next section, we’ll go through the basic logic gates and show how the MCP neuron can model them. For a corresponding Python implementation of these examples in action, checkout the corresponding Jupyter notebook on Github here.

The OR Gate

The first logic gate that we will go through is the OR gate. The OR gate indicates if there are any positive (as opposed to null) signals amongst the inputs. It will output a $1$ if at least one of the input signals is a $1$. For two input signals, the OR gate’s truth table looks like this:

To reproduce the OR Gate using an MCP neuron, all of the weights should be $1$, so that the neuron “considers” all inputs, and the threshold value should be $1$, so that only one positive signal is required (at minimum) for the neuron to “activate”.

The AND Gate

The next logic gate is the AND gate. The AND gate indicates if all of the inputs signals are positive. It will output $1$ only if all of its input signals are $1$. It’s truth table looks this:

To reproduce the AND Gate using an MCP neuron, all of the weights should be $1$, again so that the neuron considers all inputs, but the threshold value should be equal to the number of inputs (eg. $2$ for the example above), so that the neuron is activated only when all inputs are positive.

The NOT Gate

So far we have seen logic gates which consider all inputs - i.e. their MCP neuron weights were all $1$. What about gates which ignore their inputs? This can be done using a NOT gate, which inverts the signal of its input, so that if the input is positive then the output will be null and vice-versa. In short, it negates its input signal. It’s truth table is shown below:

To specify a NOT Gate using an MCP neuron, set the input weights to $-1$ and the threshold value to 0, so that the output is only ever positive when the input signal is null.

Conclusion

The MCP Neuron seems almost too simple to represent artificial intelligence of any kind, yet it is - and it isn’t. Formal logic is a fundamental component of intelligence. For any machine to have artificial intelligence, it surely should be able to comprehend logic gates. The idea being that logic gates can be stringed together to form logic circuits, capable of executing any kind of instruction. This is indeed what underpins modern computational processors. However, we know that CPU’s aren’t really “intelligent” - they’re just able to process any instruction given to them at lightning speed.

What makes the MCP Neuron different is the fact that it could reproduce logic gates using a biologically inspired algorithm. In the field of artificial intelligence, this was a promising achievement, since it almost surely makes sense that any kind of artificial intelligence should resemble the brain - which is after all the great stage of human intelligence.

The problem with the MCP Neuron is that every logic gate which it could model (and hence every logic circuit which a collection of neurons could model) had to be pre-programmed, something which is clear in the Jupyter notebook accompanying this post. This stands out as a massive contrast to the brain, which learns from experience. Nonetheless, it would take another 14 years before Frank Rosenblatt’s landmark debut of the Perceptron - the first learning algorithm of its kind. That the perceptron - and hence artificial neural networks - is a direct extension of the MCP Neuron, is what makes the MCP Neuron a cornerstone of artificial intelligence, and thus the beginning of our journey.

This is the first post in a series dedicated to the history of Artificial Neural Networks. Special thanks go to Sebastian Raschka and Andrey Kurenkov for the inspiration, and to Andrew Ng for his passion and dedication to the field of Deep Learning. The style of this blogpost is intended to be conversational and informal. For a more formal treatment of the mathematics and code, checkout the Jupyter notebook version on Github here.

Photo by Meddy Huduti on Unsplash.