Deep Learning From Scratch - Theory and Implementation
Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
Training criterion
Great, so now we are able to classify points using a linear classifier and compute the probability that the point belongs to a certain class, provided that we know the appropriate parameters for the weight matrix and bias
The misclassification rate
Ideally, we want to find a line that makes as few errors as possible. For every point
Generally, we do not know the data-generating distribution
We could use the training data to find a classifier that minimizes the misclassification rate on the training samples:
However, it turns out that finding a linear classifier that minimizes the misclassification rate is an intractable problem, i.e. its computational complexity is exponential in the number of input dimensions, rendering it unpractical. Moreover, even if we have found a classifier that minimizes the misclassification rate on the training samples, it might be possible to make the classifier more robust to unseen samples by pushing the classes further apart, even if this does not reduce the misclassification rate on the training samples.
Maximum likelihood estimation
An alternative is to use maximum likelihood estimation, where we try to find the parameters that maximize the probability of the training data:
We refer to
We can regard
Building an operation that computes J
We can build up
Going from the inside out, we can see that we need to implement the following operations:
: The element-wise logarithm of a matrix or vectorl o g : The element-wise product of two matrices⊙ : Sum over the columns of a matrix∑ j = 1 C : Sum over the rows of a matrix∑ i = 1 N : Taking the negative−
Let's implement these operations.
log
This computes the element-wise logarithm of a tensor.
multiply / ⊙
This computes the element-wise product of two tensors of the same shape.
reduce_sum
We'll implement the summation over rows, columns, etc. in a single operation where we specify an axis
. This way, we can use the same method for all types of summations. For example, axis = 0
sums over the rows, axis = 1
sums over the columns, etc. This is exactly what numpy.sum
does.
negative
This computes the element-wise negative of a tensor.
Putting it all together
Using these operations, we can now compute
J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))
Example
Let's now compute the loss of our red/blue perceptron.