How Neural Networks Work: Linear Regression and Gradient Descent Step by Step

How Neural Networks Work: Linear Regression and Gradient Descent Step by Step How Neural Networks Work: Linear Regression and Gradient Descent Step by Step

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

As we have said, a neuron is a processing unit, it receives some signals, performs some calculations, and produces another signal.

artificial neural network

So let's look at the simplest example, the case in which a signal is received and another is output, and we'll see it using linear regression.

!neural network regression

Let's suppose we have made some measurements and obtained the following points

	
import numpy as np
x = np.array( [ 0. , 0.34482759, 0.68965517, 1.03448276, 1.37931034,
1.72413793, 2.06896552, 2.4137931 , 2.75862069, 3.10344828,
3.44827586, 3.79310345, 4.13793103, 4.48275862, 4.82758621,
5.17241379, 5.51724138, 5.86206897, 6.20689655, 6.55172414,
6.89655172, 7.24137931, 7.5862069 , 7.93103448, 8.27586207,
8.62068966, 8.96551724, 9.31034483, 9.65517241, 10. ])
z = np.array( [-0.16281253, 1.88707606, 0.39649312, 0.03857752, 4.0148778 ,
0.58866234, 3.35711859, 1.94314906, 6.96106424, 5.89792585,
8.47226615, 3.67698542, 12.05958678, 9.85234481, 9.82181679,
6.07652248, 14.17536744, 12.67825433, 12.97499286, 11.76098542,
12.7843083 , 16.42241036, 13.67913705, 15.55066478, 17.45979602,
16.41982806, 17.01977617, 20.28151197, 19.38148414, 19.41029831])
Copied
	
import matplotlib.pyplot as plt
plt.scatter(x, z)
plt.xlabel('X')
plt.ylabel('Z ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

As we can see, this can be likened to a linear regression. That is, we can assume that the neuron receives x, multiplies it by a number, and produces z.

neural network regression

From here on, we are going to show how neural networks work, but starting with a simple example of just one neuron. Then we’ll progressively present more complex examples until we explain the general functioning of neural networks. But if you understand what is going to happen next, you will understand neural networks.

Our neuron has the parameter a, which is the one we want to change so that the line it generates resembles the points as closely as possible. The learning process of our neuron will consist of determining the best possible value of a through a series of calculations.

Random Initialization of the Parameterlink image 7

This example is simple, but when we have complex neural networks and we do not know what values their parameters should have, what is done is to initialize them randomly.

	
import random
random.seed(45) # Esto es una semilla, cuando se generan números aleatorios,
# pero queremos que siempre se genere el mismo se suele fijar
# un número llamado semilla. Esto hace que siempre a sea el mismo
a = random.random()
a
Copied
	
0.2718754143840908

The value of a is 0.271875, let's see what line would result if we stopped now

	
z_p = a*x
plt.scatter(x, z)
plt.plot(x, z_p, 'k')
plt.xlabel('X')
plt.ylabel('Z ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

As we can see, it doesn't look alike at all, so we will have to make our neuron *learn*

Calculation of error or losslink image 8

To find the best possible value for a, we want to find a value that makes the outputs predicted by our neuron have the smallest possible error compared to the actual values of z.

In this type of problems, the mean squared error (MSE) is usually used. There are many more error functions, but for now they are not relevant, so stick with this one and we will learn more functions later on.

In the literature, this error is commonly referred to as the loss function, so from now on we will call it that.

The mean squared error (MSE) measures the distance between the points predicted by our neuron and the actual values of z, hence the word *error* in its name.

(zp - z)

mean squared error

However, sometimes that distance will be positive and sometimes negative, depending on whether the predicted value from our neuron or the value of z is taken first, so that distance is squared, hence the word *quadratic* in the name.

(zp-z)2

Lastly, all the squared distances are added together and divided by the number of samples—in other words, calculating an average as usual, hence the word *mean* in the name.

loss = i=1N ≤ft(zp-z\right)2N

We already have a way to calculate the MSE (mean squared error).

In our case, our loss is

	
def loss(z, z_p):
n = len(z)
loss = np.sum((z_p-z) ** 2) / n
return loss
Copied
	
error = loss(z, z_p)
error
Copied
	
103.72263739946467

Although this does not tell us much, we must remember that we are looking for the minimum of the error function, so we should look for a value close to 0.

Let's see how the loss function changes depending on the value of ```a

```

	
posibles_a = np.linspace(0, 4, 30)
perdidas = np.empty_like(posibles_a)
for i in range (30):
z_p = posibles_a[i]*x
perdidas[i] = loss(z, z_p)
plt.plot(posibles_a, perdidas)
plt.xlabel('a')
plt.ylabel('loss ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

We can see that the error or loss is smaller when a is around 2. You might think, that's it, problem solved, and that's true. But as I told you, we were going to start with the simplest problem, so by looking at a graph we can solve it.

If the problem had 2 parameters, we could look at a 3-dimensional graph to search for the minimum.

grafico 3 dimensiones

But as soon as our problem has more than 2 parameters, we can no longer look for the minimum error using a graph. Not to mention that neural networks have millions of parameters; for example, the resnet18 neural network (which we will study later) is a small network and has around 11 million parameters. It is impossible to look for the minimum error there manually. So, we need an automatic method using calculations.

Gradient Descentlink image 9

As we have said, we need to find the value of a that makes the loss function minimal and, at the same time, do so using an algorithm.

One of the peculiarities of a minimum of a function is that its gradient or derivative is 0.

If you don't know what the derivative or gradient is, the derivative of a function at a point represents the slope of the tangent line to the function at that point.

derivada

For example, in this image, the derivative of A, B, and C are the green, blue, and black lines, respectively.

The derivative measures the slope of a function; the steeper the function is at a point, the more perpendicular to the *x-axis* the derivative will be at that point, and the less steep the function is at a point, the more parallel to the *x-axis* the derivative will be at that point.

How is the gradient of the loss function with respect to a calculated? We previously defined the loss function as.

loss = i=1N ≤ft(zp-z\right)2N

Well, if we differentiate it with respect to a we get

\partial loss\partial a = \partial ≤ft(i=1N ≤ft(zp-z\right)2N\right)\partial a =\partial ≤ft(i=1N ≤ft(ax-z\right)2N\right)\partial a = 2Ni=1N {≤ft(ax-z\right) x} = 2Ni=1N {≤ft(zp-z\right) x}

If we look again at the graph of the loss function with respect to the value of a, the steeper the function, that is, the greater the derivative, the further we are from the minimum. And the smaller the derivative, the lesser the slope, the closer we are to the minimum.

	
def gradiente (a, x, z):
# Función que calcula el valor de una derivada en un punto
n = len(z)
return 2*np.sum((a*x - z)*x)/n
Copied
	
def gradiente_linea (i, a=None, error=None, gradiente=None):
# Función que devuleve los puntos de la linea que supone la derivada de una
# función en un punto dado
if a is None:
x1 = posibles_a[i]-0.7
x2 = posibles_a[i]
x3 = posibles_a[i]+0.7
b = perdidas[i] - gradientes[i]*posibles_a[i]
z1 = gradientes[i]*x1 + b
z2 = perdidas[i]
z3 = gradientes[i]*x3 + b
else:
x1 = a-0.7
x2 = a
x3 = a+0.7
b = error - gradiente*a
z1 = gradiente*x1 + b
z2 = error
z3 = gradiente*x3 + b
x_linea = np.array([x1, x2, x3])
z_linea = np.array([z1, z2, z3])
return x_linea, z_linea
Copied
	
posibles_a = np.linspace(0, 4, 30)
perdidas = np.empty_like(posibles_a)
gradientes = np.empty_like(posibles_a)
for i in range (30):
z_p = posibles_a[i]*x
perdidas[i] = loss(z, z_p)
gradientes[i] = gradiente(posibles_a[i], x, z) # Estos son los valores de las derivadas en cada valor de a
# es decir, nos da el valor de la pendiente de la recta tangente
# a la curva
# Se calcula la linea del gradiente en el inicio
i_inicio = 3
x_inicio, z_inicio = gradiente_linea(i_inicio)
# Se calcula la linea del gradiente en la base
i_base = 14
x_base, z_base = gradiente_linea (i_base)
# Se calcula la linea del gradiente al final
i_final = -3
x_final, z_final = gradiente_linea (i_final)
# Se dibuja el error en función de a
plt.plot(posibles_a, perdidas, linewidth = 3)
# Se dibuja la derivada al inicio
plt.plot(x_inicio, z_inicio, 'g')
plt.scatter(posibles_a[i_inicio], perdidas[i_inicio], c='green')
# Se dibuja la derivada en el medio
plt.plot(x_base, z_base, 'y')
plt.scatter(posibles_a[i_base], perdidas[i_base], c='pink')
# Se dibuja la derivada al final
plt.plot(x_final, z_final, 'r')
plt.scatter(posibles_a[i_final], perdidas[i_final], c='red')
plt.xlabel('a')
plt.ylabel('loss ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

As can be seen, at the beginning of the graph, we are far from the minimum, so the derivative of the function (green line) is very steep, just like at the end of the function (red line). However, when we are near the minimum, the derivative is small (yellow line).

Reminder, what we want is to modify the value of a so that the cost function is minimized. This means that the error for all pairs (x, z) will be as small as possible. Well, we already have a way to know how far or close we are to that minimum. Now we need to know how to modify a to make it reach the minimum cost zone.

The way to do this is through **gradient descent**. What we are going to do is modify the value of a based on the value of the gradient.

a' = a - α\partial loss\partial a

As you can see, the derivative of the loss function multiplied by α, which is known as the **learning rate**, is subtracted from a. Let's look at this step by step.

First, the derivative of the loss function is subtracted from a, let's see why. Suppose we are at the first point of the cost function (the green line); as we have seen, its derivative has a steep slope, but it is also negative (since if we move from left to right, the derivative goes down), so if we subtract a negative value from a, what we are actually doing is adding a value to it, that is, we are making a larger, which brings it closer to the minimum area.

Now the other way around, let's suppose we are at the last point (the one on the red line). At that point, the derivative has a steep slope, but it is also positive (since if we move from left to right, the derivative goes up). Therefore, at that point we are subtracting a positive number from a, that is, we are making a smaller, bringing it closer to the minimum area.

Let's now see what the **learning rate** α means.

This is a learning factor that we choose, that is, we are setting how quickly a will move. In other words, we are configuring the learning rate of the neural network. The larger α is, the faster the network will learn, while the smaller it is, the slower it will learn.

Later on, we will study what happens when changing the value of α, but for now just keep in mind that typical values of α are between 10-3 and 10-4.

Training looplink image 10

We already have a way to determine the error introduced by the chosen value of a, and a formula to modify the value of a. Now all that's left is to repeat this loop several times until we reach the minimum of the cost function.

	
lr = 10**-3 # Tasa de aprendizaje o learning rate
steps = 100 # Numero de veces que se realiza el bucle de enrtenamiento
# Matrices donde se guardarán los datos para luego ver la evolución del entrenamiento en una gráfica
Zs = np.empty([steps, len(x)])
Xs_linea_gradiente = np.empty([steps, len(x_inicio)])
Zs_linea_gradiente = np.empty([steps, len(z_inicio)])
As = np.empty(steps)
Errores = np.empty(steps)
for i in range(steps):
# Calculamos el gradiente
dl = gradiente(a, x, z)
# Corregimos el valor de a
a = a - lr*dl
# Calculamos los valores que obtiene la red neuronal
z_p = a*x
# Obtenemos el error
error = loss(z_p, z)
# Obtenemos las rectas de los gradientes para representarlas
x_linea_gradiente, z_linea_gradiente = gradiente_linea(i_inicio, a=a, error=error, gradiente=dl)
# Guardamos los valores para luego ver la evolución del entrenamiento en una gráfica
As[i] = a
Zs[i,:] = z_p
Errores[i] = error
Xs_linea_gradiente[i,:] = x_linea_gradiente
Zs_linea_gradiente[i,:] = z_linea_gradiente
# Imprimimos la evolución del entrenamiento
if (i+1)%10 == 0:
print(f"i={i+1}: error={error}, gradiente={dl}, a={a}")
Copied
	
i=10: error=28.075728775043547, gradiente=-61.98100236918255, a=1.1394358551489718
i=20: error=9.50524503591466, gradiente=-30.709631939506394, a=1.569284692740735
i=30: error=4.946395605365449, gradiente=-15.215654116766258, a=1.7822612353503635
i=40: error=3.8272482302958437, gradiente=-7.5388767490642445, a=1.887784395436304
i=50: error=3.552509863476323, gradiente=-3.7352756707945205, a=1.9400677933076504
i=60: error=3.4850646173147437, gradiente=-1.8507112931062708, a=1.9659725680580205
i=70: error=3.4685075514689503, gradiente=-0.9169690786711168, a=1.978807566971417
i=80: error=3.464442973977972, gradiente=-0.4543292594425819, a=1.9851669041474533
i=90: error=3.4634451649256888, gradiente=-0.22510581958202128, a=1.9883177551597053
i=100: error=3.463200213782052, gradiente=-0.111532834297016, a=1.989878902465877

Let's represent the evolution of the training in a graph

	
# Creamos GIF con la evolución del entrenamiento
from matplotlib.animation import FuncAnimation
from IPython.display import display, Image
# Creamos la gráfica inicial
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_tight_layout(True)
ax1.set_xlabel('X')
ax1.set_ylabel('Z ', rotation=0)
ax2.set_xlabel('a')
ax2.set_ylabel('loss ', rotation=0)
# Se dibujan los datos que persistiran en toda la evolución de la gráfica
ax1.scatter(x, z)
ax2.plot(posibles_a, perdidas, linewidth = 3)
# Se dibuja el resto de lineas que irán cambiando durante el entrenamiento
line1, = ax1.plot(x, Zs[0,:], 'k', linewidth=2) # Recta generada con la pendiente a aprendida
line2, = ax2.plot(Xs_linea_gradiente[0,:], Zs_linea_gradiente[0,:], 'g') # Gradiente de la función de error
punto2, = ax2.plot(As[0], Errores[0], 'r*') # Punto donde se calcula el gradiente
# Se dibujan textos dentro de la segunda figura del subplot
fontsize = 12
a_text = ax2.text(1, 150, f'a = {As[0]:.2f}', fontsize = fontsize)
error_text = ax2.text(1, 125, f'loss = {Errores[0]:.2f}', fontsize = fontsize)
# Se dibuja un título
titulo = fig.suptitle(f'step: {0}', fontsize=fontsize)
# Se define la función que va a modificar la gráfica con la evolución del entrenamiento
def update(i):
# Se actualiza la linea 1. Recta generada con la pendiente a aprendida
line1.set_ydata(Zs[i,:])
# Se actualiza la linea 2. Gradiente de la función de error
line2.set_xdata(Xs_linea_gradiente[i,:])
line2.set_ydata(Zs_linea_gradiente[i,:])
# Se actualiza el punto 2. Punto donde se calcula el gradiente
punto2.set_xdata([As[i]])
punto2.set_ydata([Errores[i]])
# Se actualizan los textos
a_text.set_text(f'a = {As[i]:.2f}')
error_text.set_text(f'loss = {Errores[i]:.2f}')
titulo.set_text(f'step: {i}')
return line1, ax1, line2, punto2, ax2, a_text, error_text
# Se crea la animación con un refresco cada 200 ms
interval = 200 # ms
anim = FuncAnimation(fig, update, frames=np.arange(0, steps), interval=interval)
# Se guarda la animación en un gif
gif_name = "GIFs/entrenamiento_regresion.gif"
anim.save(gif_name, dpi=80, writer='pillow')
# Leer el GIF y mostrarlo
with open(gif_name, 'rb') as f:
display(Image(data=f.read()))
# Se elimina la figura para que no se muestre en el notebook
plt.close()
Copied
	
<IPython.core.display.Image object>

We are going to explain the process again, but without focusing so much on each detail to reinforce the concepts.

We have the following values x and y

	
x = np.array( [ 0. , 0.34482759, 0.68965517, 1.03448276, 1.37931034,
1.72413793, 2.06896552, 2.4137931 , 2.75862069, 3.10344828,
3.44827586, 3.79310345, 4.13793103, 4.48275862, 4.82758621,
5.17241379, 5.51724138, 5.86206897, 6.20689655, 6.55172414,
6.89655172, 7.24137931, 7.5862069 , 7.93103448, 8.27586207,
8.62068966, 8.96551724, 9.31034483, 9.65517241, 10. ])
z = np.array( [-0.16281253, 1.88707606, 0.39649312, 0.03857752, 4.0148778 ,
0.58866234, 3.35711859, 1.94314906, 6.96106424, 5.89792585,
8.47226615, 3.67698542, 12.05958678, 9.85234481, 9.82181679,
6.07652248, 14.17536744, 12.67825433, 12.97499286, 11.76098542,
12.7843083 , 16.42241036, 13.67913705, 15.55066478, 17.45979602,
16.41982806, 17.01977617, 20.28151197, 19.38148414, 19.41029831])
plt.scatter(x, z)
plt.xlabel('X')
plt.ylabel('Z ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

So we use a single neuron to try to find the line that best fits those points.

neural network regression

We just need to find the best possible value of a

We start by initializing a with a random value

	
random.seed(45)
a = random.random()
a
Copied
	
0.2718754143840908

We calculate the z values generated by the neuron with the value of a that we just initialized.

	
z_p = a*x
plt.scatter(x, z)
plt.plot(x, z_p, 'k')
plt.xlabel('X')
plt.ylabel('Z ', rotation=0)
plt.show()
Copied
	
<Figure size 640x480 with 1 Axes>

But we see that with the value of a we have set, the neuron is not able to resemble the points.

We need to know how good or bad the value of a is, so we calculate the error of the neuron's output with respect to the data we have. For this, we use the mean squared error (MSE) using the formula

loss = i=1N ≤ft(zp-z\right)2N

Right now, our error is

	
error = loss(z, z_p)
error
Copied
	
103.72263739946467

Since we already have a way to measure the error, we want to decrease the error, so we look for the gradient of the error with respect to a to be zero, or as close to zero as possible. To achieve this, we perform a training loop in which we modify the value of a using the formula

a' = a - α\partial loss\partial a

Where α is called the learning rate and determines the speed of learning.

	
lr = 10**-3 # Tasa de aprendizaje o learning rate
steps = 100 # Numero de veces que se realiza el bucle de enrtenamiento
# Matrices donde se guardarán los datos para luego ver la evolución del entrenamiento en una gráfica
Zs = np.empty([steps, len(x)])
Xs_linea_gradiente = np.empty([steps, len(x_inicio)])
Zs_linea_gradiente = np.empty([steps, len(z_inicio)])
As = np.empty(steps)
Errores = np.empty(steps)
for i in range(steps):
# Calculamos el gradiente
dl = gradiente(a, x, z)
# Corregimos el valor de a
a = a - lr*dl
# Calculamos los valores que obtiene la red neuronal
z_p = a*x
# Obtenemos el error
error = loss(z, z_p)
# Obtenemos las rectas de los gradientes para representarlas
x_linea_gradiente, z_linea_gradiente = gradiente_linea(i_inicio, a=a, error=error, gradiente=dl)
# Guardamos los valores para luego ver la evolución del entrenamiento en una gráfica
As[i] = a
Zs[i,:] = z_p
Errores[i] = error
Xs_linea_gradiente[i,:] = x_linea_gradiente
Zs_linea_gradiente[i,:] = z_linea_gradiente
# Imprimimos la evolución del entrenamiento
if (i+1)%10 == 0:
print(f"i={i+1}: error={error}, gradiente={dl}, a={a}")
Copied
	
i=10: error=28.075728775043547, gradiente=-61.98100236918255, a=1.1394358551489718
i=20: error=9.50524503591466, gradiente=-30.709631939506394, a=1.569284692740735
i=30: error=4.946395605365449, gradiente=-15.215654116766258, a=1.7822612353503635
i=40: error=3.8272482302958437, gradiente=-7.5388767490642445, a=1.887784395436304
i=50: error=3.552509863476323, gradiente=-3.7352756707945205, a=1.9400677933076504
i=60: error=3.4850646173147437, gradiente=-1.8507112931062708, a=1.9659725680580205
i=70: error=3.4685075514689503, gradiente=-0.9169690786711168, a=1.978807566971417
i=80: error=3.464442973977972, gradiente=-0.4543292594425819, a=1.9851669041474533
i=90: error=3.4634451649256888, gradiente=-0.22510581958202128, a=1.9883177551597053
i=100: error=3.463200213782052, gradiente=-0.111532834297016, a=1.989878902465877

We can see that the error has decreased significantly, from 103.72 that we had initially to 3.46 that we have now.

We represent the evolution of the training to see it graphically

GIF

Continue reading

Last posts -->

Have you seen these projects?

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

Subtify

Subtify Subtify
Python
Whisper
Spaces

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->