How Machines Learn

The Fundamentals of Machine Learning

The last few years have brought us incredible advancements in machine learning. These diverse systems are built from a few core building blocks, which are modified and combined in complex ways.

These core ideas often get abstracted away when discussing the bigger picture, which leaves us to wonder — “What’s going on inside?”

In this interactive lesson, we will answer that question by thinking small.

We will start with the simplest of models — The Line, and see how it can help us build a Neural Network capable of learning anything. Then, going back to the line, we will learn how to train it using the Gradient Descent algorithm, which we will write from scratch in Python.

Modelling Our World

The power of machine learning lies in its ability to distill down large volumes of data into useful representations we call models.

Think of models like play dough. They have adjustable parameters which let us mold them into various shapes. Training is the process of molding them based on data.

Weights and Biases

The line is one of the simplest of models with just two parameters. The weight ww controls the line’s tilt and the bias bb shifts it.

By the way, we will label inputs as xx and outputs as yy.

y(x)=wx+by(x) = wx + b
w = 1
-10 10
b = 0
-10 10

We can generalize this to multiple inputs x1,x2,...xnx_1, x_2, ... x_n:

y(x1,x2,...,xn)=w1x1+w2x2+...+wnxn+by(x_1, x_2, ..., x_n) = w_1x_1 + w_2x_2 + ... + w_nx_n + b

Notice that each input xix_i gets scaled by its own weight wiw_i.

By adjusting these weights, we can tune how much each input influences the value of yy.

The bias bb is independent of any input, it simply shifts yy towards the positive or negative direction.

We can see this in action by plotting a linear model with two inputs, y(x1,x2)y(x1, x2).

w1 = 0
-10 10
w2 = 0
-10 10
b = 0
-10 10

Activation Functions

The linear model is… well, linear. But not everything in our world is described in straight lines.

Here’s a non-linear function known as the sigmoid σ\sigma.

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

It smoothly transitions from 00 to 11.

Something interesting happens when we pass a linear model into the sigmoid: σ(wx+b)\sigma(wx + b).

y(x)=σ(wx+b)=11+ewx+by(x) = \sigma(wx + b) = \frac{1}{1 + e^{-wx + b}}
w = 1
-10 10
b = 0
-10 10

We are able to use the ww and bb parameters of the line to change the sharpness and center of the sigmoid’s transition!

This has transformed our line into something like an on-off switch, where we can control the threshold with bb and represent uncertainty with ww.

Functions like the sigmoid into which we pass our linear model are known as Activation Functions. There are many of them out there, of which the sigmoid is the most classic example.

Like before, we can extend this to multiple inputs:

y(x1,x2,...,xn)=σ(w1x1+w2x2+...+wnxn+b)y(x_1, x_2, ..., x_n) = \sigma(w_1x_1 + w_2x_2 + ... + w_nx_n + b)

Here’s y(x1,x2)=σ(w1x1+w2x2+b)y(x_1, x_2) = \sigma(w_1x_1 + w_2x_2 + b):

w1 = 1
-10 10
w2 = 2
-10 10
b = 0
-10 10

If we view it from the top, we see that the activated and non-activated regions are split by a linear boundary.

The model we’ve just created takes in the values of its inputs, scales them by their weights, sums these results with the bias and then feeds the sum into the sigmoid, which then decides whether to activate or not.

Structure of an Artificial Neuron

In structure, this is similar to a biological neuron. Dendrites on the neuron’s body receive inputs from other neurons. Based on this, the neuron decides whether it should fire a signal through its axon, to which more neurons may be connected. These connections can grow stronger, analogous to the weights in artificial neurons.

fundamentals-biological-neuron Image credit: Wikipedia.

Neural Networks

Like biological neurons, we can treat the outputs of artificial neurons as inputs for other neurons.

Here’s a setup where yy is a special output neuron (we give it no activation function). Its input is the output of the previous neuron in the chain, this connection has an adjustable weight.

wi1=1wo1=6.28xb1=0y
wi1 = 1
-10 10

Tweaking this weight allows us to scale the range of the sigmoid function, so it’s no longer confined between 00 and 11!

But we can go further, and add neurons in parallel. Here’s two of them.

wi1=-1.5wi2=1.5wo1=-5wo2=5xb1=-8b2=-8y
wi1 = -1.5
-10 10

They stack on top of each other. What if we add more neurons?

wi1=0.2wi2=-0.5wi3=1.9wi4=-0.8wo1=7.2wo2=9.7wo3=-6.3wo4=-6.3xb1=1b2=-5.7b3=0b4=7.1y
wi1 = 0.2
-10 10

This network can draw some variety of functions!

In fact, if we add more neurons, we’re able to draw more intricate functions.

And here, we’ve come face-to-face with what makes Neural Networks so powerful. As long as we have enough neurons in the middle layer, we can mold our network to the shape of any arbitrary function we like, to the level of accuracy we need. They are Universal Function Approximators.

And this all began with the humble line.

The rest of this article focuses on the training process. To keep us focused, will use the line as our model, but these ideas apply to all the other models we’ve seen.

Creating a model in code

The code block below plots y(x)=xy(x) = x. I’d like you to update this function to plot y(x)=wx+by(x) = wx + b by using the w and b variables already defined. Once done, click on “Run” and you should be able to use the sliders to control the weight and bias of the line.

Plotting a line

Loading editor...

b = 0
w = 1

def y(x, w, b):
  return x
b = 0
-10 10
w = 0
-10 10

Great! We now have a model which can fit to some data.

Fitting a model to data

Our data will exist as (x,y)(x, y) pairs. Where xx is the value of some input variable, and yy is the corresponding output of that variable.

This data could represent anything!

  • Maybe you want to predict how many ice-creams you can sell (y)(y) based on the current temperature (x)(x).
  • Or how much energy your solar panels can generate (y)(y) based on the hours of daylight (x)(x).
  • Or if you’re Sherlock, how tall a person is (y)(y) based on the length of their footprint (x)(x).

In all these cases, we can reasonably assume that xx and yy are roughly linearly related. Knowing this, we can choose to use our linear model to fit to this data.

Data from the natural world will have randomness, our variables may also not be linearly related throughout (maybe ice-cream sales start off linearly, but plateau beyond a certain point). We should be aware of a model’s limitations when we use it.

To keep things simple, we will use some made-up data for our program. But you may edit it as you wish!

Now, can you use the sliders to find the values of ww and bb that best fit this data?

Fitting the line to data

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

def y(x, w, b):
  return w * x + b
b = 0
-10 10
w = 0
-10 10

Great! The values of ww and bb you have found, are now a condensed representation of our data points, which we would no longer require. We can instead feed their xx values into our trained model and get approximately the same yy values out of it.

Fitting the line to data

Loading editor...

x_values = [1, 2, 3, 4, 5, 6, 7, 8]
data = []

w = 0.6
b = 0.2

def y(x, w, b):
  return w * x + b

for xi in x_values:
  data_point = (xi, y(xi, w, b))
  data.append(data_point)

But now we can do more, and also predict yy for values of xx that were not in our data!

Fitting the line to data

Loading editor...

x_values = [-3, 6.28]
data = []

w = 0.6
b = 0.2

def y(x, w, b):
  return w * x + b

for xi in x_values:
  data_point = (xi, y(xi, w, b))
  data.append(data_point)

This is how a trained model is useful. Not only are we storing a more compact representation of the data through its parameters, but we can also get predictions for inputs it never saw before.

This is also true for neurons and neural networks. Which may have more inputs, and more parameters to model a variety of shapes.

Our next step will be to figure out how to train the model automatically.

The Loss Function

When you fit the line to the data points in the previous section, you were able to visually see how well they were aligned.

But machines don’t see, they crunch numbers. So we need a way to quantify how well the model matches its data.

For any single data point (xi,yi)(x_i,y_i), we can measure the absolute difference between the output yiy_i and the model’s prediction based on the current values of ww and bb, yw,b(xi)y_{w,b}(x_i).

yw,b(xi)yi\left| y_{w,b}(x_i) - y_i \right|

Then we can sum these up for every data point.

L1(w,b)=i=1nyw,b(xi)yiL_1(w, b) = \sum_{i=1}^{n} \left| y_{w,b}(x_i) - y_i \right|

We have just created a Loss Function, this one is known as L1L1 loss.

As an alternative to the modulus|modulus| function, we can also square2square^2 the difference. This gives us the Loss Function named L2L2 loss and it has a couple of advantages:

  1. Unlike the modulus function, the square function is differentiable at all points, and we will need that for the Gradient Descent algorithm.
  2. Since the error gets squared, data points that are way off get penalized a lot more than those which are closer to the line.
L2(w,b)=12i=1n(yw,b(xi)yi)2L_2(w, b) = \frac{1}{2} \sum_{i=1}^{n} ( y_{w,b}(x_i) - y_i)^2

The 12\frac{1}{2} factor will just make things convenient when computing derivatives!

Let’s see our loss function in action!

The Loss Function

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

def loss():
  sum = 0
  for xi, yi in data:
    sum += (y(xi, w, b) - yi) ** 2
  return sum / 2

def y(x, w, b):
  return w * x + b
loss() =
b = 0
-10 10
w = 0
-10 10

The Loss Landscape

The Loss Function’s inputs are our model’s parameters — ww and bb. We can create a plane, with a ww-axis and a bb-axis. Picking any point on this plane would give us a combination of (w,b)(w,b) which can be used to represent a unique line.

This plane is our model’s parameter space. We can compute our Loss Function for every point on it, and plot the output on a third axis.

This visualization is known as the Loss Landscape of our model.

The Loss Landscape

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

def loss(w, b):
  sum = 0
  for xi, yi in data:
    sum += (y(xi, w, b) - yi) ** 2
  return sum / 2

def y(x, w, b):
  return w * x + b
loss(w,b) =
b = 0
-15 15
w = 0
-10 10

After plotting the Loss Landscape, we can easily see that there’s a point where the loss is the lowest.

Our job in the next steps will be to develop an algorithm which descends along the Loss Landscape to reach the bottom of this valley from any starting position, without computing every point on the Loss Landscape.

For this, we would need a way for our machines to know which way is down…

Gradients: Which way is down?

Derivatives

This section assumes knowledge of Derivatives. We’ll quickly see how they’re useful to us.

Here’s a roller-coaster of a polynomial f(x)f(x). If we draw a line tangent to it at some value of xx, we see that this tangent has a changing slope.

x = 0
-2.5 2.5

slope of tangent: 1.3000

Since this slope depends on xx, we can create a function that takes in xx and returns the slope of the tangent. That function is known as the derivative of f(x)f(x).

Derivatives are usually written in the forms dfdx\frac{df}{dx} or f(x)f'(x), where ff is the original function and xx is the variable we are sliding.

How is this useful? Well, let’s use the widgets above to play a simple game:

  1. Pick a random value of xx to start with.
  2. If the slope is positive, move the slider slightly to the left.
  3. If instead it is negative, move it to the right.
  4. Repeat the last two steps, and you will always end up in a “valley” of the graph.

This works because the derivative always tells us which direction the function is increasing in. By going in the opposite direction (going left when the slope is positive), we always head downhill.

That’s exactly what we’re looking to do with our Loss Function too! But we still have one problem: our Loss Function L2(w,b)L_2(w,b) has two inputs, not one.

So we will need to do something clever about that.

Partial Derivatives

If we fix one of the inputs to some constant, say L2(w,b=1)L_2(w, b = 1), then we’ve converted our two-variabled Loss Function into a single variable function, because we can only adjust the other variable (ww).

This would allow us to compute the derivative of this function like any other single variabled function.

Partial Derivatives are based on this idea. Except, the other variables are only treated as constants, and are not actually set to any numeric value.

We will instead set them to specific constants when we are evaluating the slope!

Our Loss Function L2(w,b)L_2(w,b) has two variables, and thus, two partial derivatives, L2w\frac{\partial L_2}{\partial w} and L2b\frac{\partial L_2}{\partial b}.

Let’s say we are at a point (w,b)=(2,1)(w,b) = (2,1). Then evaluating L2w\frac{\partial L_2}{\partial w} here will fix bb (the variable we are not differentiating) to b=1b=1, giving us a single variable function of ww. The slope along the ww axis is returned for w=2w=2.

Similarly, L2b\frac{\partial L_2}{\partial b} will give us the slope along the bb axis at the same point (2,1)(2,1).

When we combine both of these partial derivatives into a single vector, we call that vector the Gradient Vector L2(w,b)\nabla L_2(w,b).

L2(w,b)=[L2wL2b]\nabla L_2(w,b) = \begin{bmatrix} \frac{\partial L_2}{\partial w}\\ \frac{\partial L_2}{\partial b}\\ \end{bmatrix}

The Gradient is to our Loss Landscape what the slope was to the polynomial. Imagine standing on the side of a hill. The Gradient tells you which direction is downwards, and how steep the ground beneath your feet is.

Computing the Partial Derivatives for our Loss Function

Here’s our Loss Function again:

L2(w,b)=12i=1n(yw,b(xi)yi)2L_2(w, b) = \frac{1}{2} \sum_{i=1}^{n} ( y_{w,b}(x_i) - y_i )^2

We will need to find the partial derivatives for both ww and bb. Let’s start with ww first.

L2w=w12i=1n(yw,b(xi)yi)2\frac{\partial L_2}{\partial w} = \frac{\partial}{\partial w} \frac{1}{2} \sum_{i=1}^{n} ( y_{w,b}(x_i) - y_i)^2

The summation \sum operator can make things messy. Since the derivative of a sum is equal to a sum of the derivatives, let’s put this operator outside the partial differentiation step.

L2w=i=1nw12(yw,b(xi)yi)2\frac{\partial L_2}{\partial w} = \sum_{i=1}^{n} \frac{\partial}{\partial w} \frac{1}{2} ( y_{w,b}(x_i) - y_i)^2

Now, we could go ahead and differentiate this directly, but there is an easier way out, which scales well for more complex models.

A worthy observation is that we are differentiating a composite function, i.e. a function within a function.

w12(yw,b(xi)yi))2\color{grey} \frac{\partial}{\partial w} \color{blue} \frac{1}{2} ( \color{orange}y_{w,b}(\color{grey}x_i\color{orange}) \color{grey} -y_i)\color{blue})^2

I’ve written the outer function in blue\color{blue}blue and the inner one in orange\color{orange}orange.

When we need to find derivatives of composite functions, we can use a property known as the Chain Rule to do this in a clean way.

The Chain Rule

Here’s the property. Let’s say we have two functions h(x)h(x) and g(x)g(x), and we use them to create a composite function f(x)f(x):

f(x)=h(g(x))f(x) = h(g(x))

The Chain Rule says that we can get the derivative dfdx\frac{df}{dx} in the form of this product:

dfdx=dhdgdgdx\frac{df}{dx} = \frac{dh}{dg}\frac{dg}{dx}

Here, dgdx\frac{dg}{dx} is simply the regular derivative of gg. But what is dhdg\frac{dh}{dg}?

Let’s say h(x)h(x) was equal to x2x^2, then dhdx\frac{dh}{dx} would be 2x2x.

But with h(g(x))h(g(x)), the input to hh is now g(x)g(x) instead of just xx. So the derivative of hh with respect to this new input is written as dhdg\frac{dh}{dg} and is computed to be 2g(x)2g(x). We just replaced the xx in 2x2x with a g(x)g(x).

This is helpful, because it’s easy to know the derivative of g(x)g(x) or h(x)h(x) individually. But to compute these derivatives for h(g(x))h(g(x)) would be much more involved. The chain rule helps us take a shortcut by using the atomic derivatives instead.

Let’s apply the Chain Rule on our Loss Function.

w12(yw,b(xi)yi)2=L2yw,b(xi)yw,b(xi)w\color{grey} \frac{\partial}{\partial w} \color{blue} \frac{1}{2} (\color{orange} y_{w,b}(\color{grey}x_i\color{orange}) \color{grey} - y_i \color{blue})^2 \color{black} = \frac{\color{blue}\partial L_2}{\color{orange}\partial y_{w,b}(x_i)} \frac {\color{orange}\partial y_{w,b}(x_i)}{\partial w}

Note that we are naming L2L_2 to be the outer function 12(x)2\color{blue}\frac{1}{2}(\color{black}x\color{blue})^2. We can ignore yiy_i here, because it is a constant which gets dropped during differentiation.

So we have to find L2yw,b(xi)\frac{\color{blue}\partial L_2}{\color{orange}\partial y_{w,b}(x_i)} and yw,b(xi)w\frac {\color{orange}\partial y_{w,b}(x_i)}{\partial w}. Let’s start with the latter.

Recall the equation of the line yw,b(x)y_{w,b}(x):

yw,b(x)=wx+by_{w,b}(x) = wx + b

When we’re computing the partial derivative with respect to ww, it remains a variable and everything else (x,bx, b in this case) is treated to be a constant. Therefore, we get:

yw,b(xi)w=xi\frac{\partial y_{w,b}(x_i)}{\partial w} = x_i

Now let’s work with the other part, L2yw,b(xi)\frac{\color{blue}\partial L_2}{\color{orange}\partial y_{w,b}(x_i)}. Our outer function is 12x2\frac{1}{2}x^2. It’s derivative is ddx12x2=122x\frac{d}{dx} \frac{1}{2}x^2 = \frac{1}{2}2x, which simplifies to just xx, the input. In our case that looks like this:

L2yw,b(xi)=yw,b(xi)yi\frac{\partial L_2}{\partial y_{w,b}(x_i)} = y_{w, b}(x_i) - y_i

We’ve computed both parts from the chain rule, let’s multiply them together:

L2w=(yw,b(xi)yi)xi\frac{\partial L_2}{\partial w} = (y_{w, b}(x_i) - y_i)x_i

And we have the partial derivative for ww! Now we do the same for bb.

L2b=L2yw,b(xi)yw,b(xi)b\frac{\partial L_2}{\partial b} = \frac{\partial L_2}{\partial y_{w,b}(x_i)} \frac{\partial y_{w,b}(x_i)}{\partial b}

We previously computed the first part to be yw,b(xi)yiy_{w, b}(x_i) - y_i. Let’s look at the second:

yw,b(xi)b=1\frac{\partial y_{w,b}(x_i)}{\partial b} = 1

Now we multiply the two together:

L2b=(yw,b(xi)yi)×1=yw,b(xi)yi\frac{\partial L_2}{\partial b} = (y_{w, b}(x_i) - y_i) \times 1 = y_{w, b}(x_i) - y_i

We now have the partial derivatives for our two parameters! Let’s package them into the Gradient Vector.

L2(w,b)=[L2wL2b]=[(yw,b(xi)yi)xiyw,b(xi)yi]\nabla L_2(w,b) = \begin{bmatrix} \frac{\partial L_2}{\partial w}\\ \frac{\partial L_2}{\partial b}\\ \end{bmatrix} = \begin{bmatrix} (y_{w, b}(x_i) - y_i)x_i\\ y_{w, b}(x_i) - y_i\\ \end{bmatrix}

Let’s try to visualize the Gradient vector on the Loss Landscape. Remember that we had moved the summation operator aside, we will need to add that back.

Visualizing the Gradient Vector

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

def loss(w, b):
  sum = 0
  for xi, yi in data:
    sum += (y(xi, w, b) - yi) ** 2
  return sum / 2

def gradient():
  dw = db = 0
  for xi, yi in data:
    dw += (y(xi, w, b) - yi) * xi
    db += y(xi, w, b) - yi
  return dw, db

def y(x, w, b):
  return w * x + b
gradient() =
loss(w,b) =
b = 0
-15 15
w = 0
-2.5 2.5

We see that the Gradient Vector is incredibly large. This is because of the steepness of the Loss Landscape. You can try to reach to the shallower regions and see the vector change size and direction.

What is useful for us is that it points in the right direction, if the point moves that way by a very small amount.

Steps in the Right Direction

We will now try to use the information in the Gradient Vector to update our model’s parameters!

Since as we just saw, the size of this vector is gigantic, we will scale it down by a factor. This is normally known as the learning rate or the step size.

Then we will update the parameters to where the scaled gradient vector tells us to go.

The choice of the learning rate is critical here. Too low, and the steps will be too small to reach our goal in a reasonable time. Too high, and our steps may overshoot our goal.

In the next code block, I’ve added a learning_rate variable. But I think I set it too high!

Can you try to find a good learning rate for our data?

Updating our parameters in steps

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

def loss(w, b):
  sum = 0
  for xi, yi in data:
    sum += (y(xi, w, b) - yi) ** 2
  return sum / 2

def gradient():
  dw = db = 0
  for xi, yi in data:
    dw += (y(xi, w, b) - yi) * xi  
    db += y(xi, w, b) - yi
  return dw, db

def y(x, w, b):
  return w * x + b

learning_rate = 0.01

def update():
  dw, db = gradient()
  global w,b
  w -= dw * learning_rate
  b -= db * learning_rate
w =
b =
gradient() =
loss(w, b) =

You know you’ve found a good learning rate when the model starts to find stability in its movement after quickly getting closer to matching the data.

Great! We are able to get our model to fit closer to the data, a step at a time! Now we just need to have it make these steps automatically!

The Gradient Descent Algorithm

The new train() function will be called by the controls below. Normally, models are trained for a fixed number of iterations, but in this case the parameter iterations is instead being used to control the speed of the animation.

To finish it off, I’d like you to call the update() function inside its for loop. :)

The Final Result

Loading editor...

data = [(1, 0.85), (2, 1.43), (3, 1.92), (4, 2.59), (5, 3.20), (6, 3.82), (7, 4.38), (8, 4.97)]

w = 1
b = 0

learning_rate = 0.001

def y(x, w, b):
  return w * x + b

def gradient():
  dw = db = 0
  for xi, yi in data:
    dw += (y(xi, w, b) - yi) * xi  
    db += y(xi, w, b) - yi
  return dw, db

def update():
  dw, db = gradient()
  global w,b
  w -= dw * learning_rate
  b -= db * learning_rate
  return w, b

iteration_counter = 0
def train(iterations = 100):
  global iteration_counter
  for i in range(iterations):
    # call update() in this loop!
    
    iteration_counter += 1
simulation speed: 1x
iteration_counter =
w =
b =

We’ve done it!

It’s just the beginning…

We’ve finally seen how machine learning models can be trained, and we’ve learnt this in detail by focusing on a simple model.

There are many paths from here. We can extend our code to multiple variables, or add an activation function, or even do both to create neural networks. We can also try different kinds optimization techniques.

Perhaps, you can train a model for some data that interests you!

Whatever that may be, I hope that you keep learning more about this fascinating subject!

Thank you for reading!


This lesson is an entry for the Summer of Math Exposition 4. Every year, it brings together a variety of math content to love! Go check them out!

While you’re at it, I want to mention Jumplion’s entry where he uses math to find The Best Phonetic Alphabet. I helped him out in a step of the audio analysis part.