Perceptrons are a miniature form of neural network and a basic building block of more complex architectures. Before going into the details, let's motivate them by an example. Assume that we are given a dataset consisting of 100 points in the plane. Half of the points are red and half of the points are blue.
As we can see, the red points are centered at and the blue points are centered at (2,2). Now, having seen this data, we can ask ourselves whether there is a way to determine if a point should be red or blue. For example, if someone asks us what the color of the point (3,2) should be, we'd best respond with blue. Even though this point was not part of the data we have seen, we can infer this since it is located in the blue region of the space.
But what is the general rule to determine if a point is more likely to be blue than red? Apparently, we can draw a line y=−x that nicely separates the space into a red region and a blue region:
Example
1
2
3
4
5
# Plot a line y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -x_axis
plt.plot(x_axis, y_axis)
Enter to Rename, Shift+Enter to Preview
We can implicitly represent this line using a weight vectorw and a biasb. The line then corresponds to the set of points x where
wTx+b=0
In the case above, we have w=(1,1)T and b=0. Now, in order to test whether the point is blue or red, we just have to check whether it is above or below the line. This can be achieved by checking the sign of wTx+b. If it is positive, then x is above the line. If it is negative, then x is below the line. Let's perform this test for our example point (3,2)T:
(11)⋅(32)=5
Since 5 > 0, we know that the point is above the line and, therefore, should be classified as blue.
Perceptron definition
In general terms, a classifier is a function c^:Rd→{1,2,...,C} that maps a point onto one of C classes. A binary classifier is a classifier where C=2, i.e. we have two classes. A perceptron with weight w∈Rd and bias b∈Rd is a binary classifier where
c^(x)={1,if wTx+b≥02,if wTx+b<0
c^ partitions Rd into two half-spaces, each corresponding to one of the two classes. In the 2-dimensional example above, the partitioning is along a line. In general, the partitioning is along a d−1 dimensional hyperplane.
From classes to probabilities
Depending on the application, we may be interested not only in determining the most likely class of a point, but also the probability with which it belongs to that class. Note that the higher the value of wTx+b, the higher is its distance to the separating line and, therefore, the higher is our confidence that it belongs to the blue class. But this value can be arbitrarily high. In order to turn this value into a probability, we need to "squash" the values to lie between 0 and 1. One way to do this is by applying the sigmoid function σ:
p(c^(x)=1∣x)=σ(wTx+b)whereσ(a)=11+e−a
Let's take a look at what the sigmoid function looks like:
Sigmoid plot
1
2
3
4
5
6
7
8
9
10
# Create an interval from -5 to 5 in steps of 0.01
a = np.arange(-5, 5, 0.01)
# Compute corresponding sigmoid function values
s = 1 / (1 + np.exp(-a))
# Plot them
plt.plot(a, s)
plt.grid(True)
plt.show()
Enter to Rename, Shift+Enter to Preview
As we can see, the sigmoid function assigns a probability of 0.5 to values where wTx+b=0 (i.e. points on the line) and asymptotes towards 1 the higher the value of wTx+b becomes, and towards 0 the lower it becomes, which is exactly what we want.
Let's now define the sigmoid function as an operation, since we'll need it later:
Sigmoid
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class sigmoid(Operation):
"""Returns the sigmoid of x element-wise.
"""
def __init__(self, a):
"""Construct sigmoid
Args:
a: Input node
"""
super().__init__([a])
def compute(self, a_value):
"""Compute the output of the sigmoid operation
Args:
a_value: Input value
"""
return1 / (1 + np.exp(-a_value))
Enter to Rename, Shift+Enter to Preview
The entire computational graph of the perceptron now looks as follows:
Example
Using what we have learned, we can now build a perceptron for the red/blue example in Python.
Building a Perceptron
1
2
3
4
5
6
7
# Create a new graph
Graph().as_default()
x = placeholder()
w = Variable([1, 1])
b = Variable(0)
p = sigmoid( add(matmul(w, x), b) )
Enter to Rename, Shift+Enter to Preview
Let's use this perceptron to compute the probability that (3,2)T is a blue point:
Using the Perceptron
1
2
3
4
session = Session()
print(session.run(p, {
x: [3, 2]
}))
Enter to Rename, Shift+Enter to Preview
Multi-class perceptron
So far, we have used the perceptron as a binary classifier, telling us the probability p that a point x belongs to one of two classes. The probability of x belonging to the respective other class is then given by 1−p. Generally, however, we have more than two classes. For example, when classifying an image, there may be numerous output classes (dog, chair, human, house, ...). We can extend the perceptron to compute multiple output probabilities.
Let C denote the number of output classes. Instead of a weight vector w, we introduce a weight matrix W∈Rd×C. Each column of the weight matrix contains the weights of a separate linear classifier - one for each class. Instead of the dot product wTx, we compute xW, which returns a vector in RC, each of whose entries can be seen as the output of the dot product for a different column of the weight matrix. To this, we add a bias vector b∈RC, containing a distinct bias for each output class. This then yields a vector in RC containing the probabilities for each of the C classes.
While this procedure may seem complicated, the matrix multiplication actually just performs multiple linear classifications in parallel, one for each of the C classes - each one with its own separating line, given by a weight vector (one column of W) and a bias (one entry of b).
Softmax
While the original perceptron yielded a single scalar value that we squashed through a sigmoid to obtain a probability between 0 and 1, the multi-class perceptron yields a vector a∈Rm. The higher the i-th entry of a, the higher is our confidence that the input point belongs to the i-th class. We would like to turn a into a vector of probabilities, such that the probability for every class lies between 0 and 1 and the probabilities for all classes sum up to 1.
A common way to do this is to use the softmax function, which is a generalization of the sigmoid to multiple output classes:
The matrix form allows us to feed in more than one point at a time. That is, instead of a single point x, we could feed in a matrix X∈RN×d containing one point per row (i.e. N rows of d-dimensional points). We refer to such a matrix as a batch. Instead of xW, we compute XW. This returns an N×C matrix, each of whose rows contains xW for one point x. To each row, we add a bias vector b, which is now an 1×m row vector. The whole procedure thus computes a function f:RN×d→Rm where f(X)=σ(XW+b). The computational graph looks as follows:
Example
Let's now generalize our red/blue perceptron to allow for batch computation and multiple output classes.
Multi-class Perceptron
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Create a new graph
Graph().as_default()
X = placeholder()
# Create a weight matrix for 2 output classes:
# One with a weight vector (1, 1) for blue and one with a weight vector (-1, -1) for red
W = Variable([
[1, -1],
[1, -1]
])
b = Variable([0, 0])
p = softmax( add(matmul(X, W), b) )
# Create a session and run the perceptron on our blue/red points
session = Session()
output_probabilities = session.run(p, {
X: np.concatenate((blue_points, red_points))
})
# Print the first 10 lines, corresponding to the probabilities of the first 10 points
print(output_probabilities[:10])
Enter to Rename, Shift+Enter to Preview
Since the first 10 points in our data are all blue, the perceptron outputs high probabilities for blue (left column) and low probabilities for red (right column), as expected.