Deep Learning From Scratch - Theory and Implementation
Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
Gradient descent
Generally, if we want to find the minimum of a function, we set the derivative to zero and solve for the parameters. It turns out, however, that it is impossible to obtain a closed-form solution for and
As a visual analogy, imagine yourself standing on a mountain and trying to find the way down. At every step, you walk into the steepest direction, since this direction is the most promising to lead you towards the bottom.
If taking steep steps seems a little dangerous to you, imagine that you are a mountain goat (which are amazing rock climbers).
Gradient descent operates in a similar way when trying to find the minimum of a function: It starts at a random location in parameter space and then iteratively reduces the error
As you might remember, the direction of steepest ascent of a function at a certain point is given by the gradient at that point. Therefore, the direction of steepest descent is given by the negative of the gradient. So now we have a rough idea how to minimize
- Start with random values for
andW b - Compute the gradients of
with respect toJ andW b - Take a small step along the direction of the negative gradient
- Go back to 2
Let's implement an operation that minimizes the value of a node using gradient descent. We require the user to specify the magnitude of the step along the gradient as a parameter called learning_rate
.
The following image depicts an example iteration of gradient descent. We start out with a random separating line (marked as 1), take a step, arrive at a slightly better line (marked as 2), take another step, and another step, and so on until we arrive at a good separating line.
Backpropagation
In our implementation of gradient descent, we have used a function compute_gradient(loss)
that computes the gradient of a
Consider the following computational graph:
By the chain rule, we have
As we can see, in order to compute the gradient of
Now consider the following scenario:
In this case,
This gives us an intuition for the general algorithm that computes the gradient of the loss with respect to another node: We perform a backwards breadth-first search starting from the loss node. At each node
- retrieve the gradient
of the loss with respect to the output ofG c - multiply
by the gradient ofG 's output with respect toc 's outputn
And then we sum those gradients over all consumers.
As a prerequisite to implementing backpropagation, we need to specify a function for each operation that computes the gradients with respect to the inputs of that operation, given the gradients with respect to the output. Let's define a decorator @RegisterGradient(operation_name)
for this purpose:
Now assume that our _gradient_registry
dictionary is already filled with gradient computation functions for all of our operations. We can now implement backpropagation:
Gradient of each operation
For each of our operations, we now need to define a function that turns a gradient of the loss with respect to the operation's output into a list of gradients of the loss with respect to each of the operation's inputs. Computing a gradient with respect to a matrix can be somewhat tedious. Therefore, the details have been omitted and I just present the results. You may skip this section and still understand the overall picture. The implementations for add
, reduce_sum
and softmax
are a little more involved, and I'd recommend not spending too much time trying to understand them.
If you want to comprehend how to arrive at the results, the general approach is as follows:
- Find the partial derivative of each output value with respect to each input value. This can be a tensor of a rank greater than 2, i.e. neither scalar nor vector nor matrix, involving a lot of summations.
- Compute the gradient of the loss with respect to the node's inputs given a gradient with respect to the node's output by applying the chain rule. This is now a tensor of the same shape as the input tensor, so if the input is a matrix, the result is also a matrix
- Rewrite this result as a sequence of matrix operations in order to compute it efficiently. This step can be somewhat tricky.
Gradient for negative
Given a gradient
Gradient for log
Given a gradient
Gradient for sigmoid
Given a gradient
Gradient for multiply
Given a gradient
Gradient for matmul
Given a gradient
Gradient for add
Given a gradient
Gradient for reduce_sum
Given a gradient reduce_sum
, the gradient with respect to the input
Gradient for softmax
Example
Let's now test our implementation to determine the optimal weights for our perceptron.
Notice that we started out with a rather high loss and incrementally reduced it. Let's plot the final line to check that it is a good separator: