Neural Network xor example from scratch (no libs)

14.8K views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Introduction

XOR example is a toy problem in machine learning community, a hello world for introducing neural networks. It means you have to build and train the neural network so that given 2 inputs it will output what a XOR function would output (at least close to it).

This isn't math heavy explanatory tutorial, there are plenty of them out there. Just google "neural network xor example", and you'll get for example 1 and 2 with detailed explanation. I assume you have some vague knowledge of neural networks and try to write a simple one. This article is just a bunch of simple python scripts that implement neural networks. No numpy or other libraries are used, so they should be easily translatable to other languages.

Codingame doesn't have NN specific libraries. For Codingame people mostly train locally using NN libraries, and just copy the learned weights and implement the forward propagation themselves in their language.

All the scripts use stochastic gradient descent to train the neural network, one data row at a time, so no need for matrix tranpositions, which would be required for mini-batch. The loss function is mean squared error.

First script

This is the simplest script, an implementation of

Here the neural network is just a bunch of loosely written variables.

import random
import math
VARIANCE_W = 0.5
w11 = random.uniform(-VARIANCE_W,VARIANCE_W)
w21 = random.uniform(-VARIANCE_W,VARIANCE_W)
b1 = 0
w12 = random.uniform(-VARIANCE_W,VARIANCE_W)
w22 = random.uniform(-VARIANCE_W,VARIANCE_W)
b2 = 0
w13 = random.uniform(-VARIANCE_W,VARIANCE_W)
w23 = random.uniform(-VARIANCE_W,VARIANCE_W)
b3 = 0
o1 = random.uniform(-VARIANCE_W,VARIANCE_W)
o2 = random.uniform(-VARIANCE_W,VARIANCE_W)
o3 = random.uniform(-VARIANCE_W,VARIANCE_W)
ob = 0
def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))
def sigmoid_prime(x): # x already sigmoided
    return x * (1 - x)
def predict(i1,i2):    
    s1 = w11 * i1 + w21 * i2 + b1
    s1 = sigmoid(s1)
    s2 = w12 * i1 + w22 * i2 + b2
    s2 = sigmoid(s2)
    s3 = w13 * i1 + w23 * i2 + b3
    s3 = sigmoid(s3)
    
    output = s1 * o1 + s2 * o2 + s3 * o3 + ob
    output = sigmoid(output)
    
    return output
def learn(i1,i2,target, alpha=0.2):
    global w11,w21,b1,w12,w22,b2,w13,w23,b3
    global o1,o2,o3,ob
    
    s1 = w11 * i1 + w21 * i2 + b1
    s1 = sigmoid(s1)
    s2 = w12 * i1 + w22 * i2 + b2

Example output:

epoch 1000 mean squared error: 0.24950644703013328
epoch 2000 mean squared error: 0.21033216889786982
epoch 3000 mean squared error: 0.06970183505211733
epoch 4000 mean squared error: 0.010193355003568688
epoch 5000 mean squared error: 0.004621174723851184
epoch 6000 mean squared error: 0.002889802406873843
epoch 7000 mean squared error: 0.0020756513163735827
epoch 8000 mean squared error: 0.001609116960177383
epoch 9000 mean squared error: 0.0013089350608506563
epoch 10000 mean squared error: 0.0011005064043563537
for input [0, 0] expected 0 predicted 0.02906 which is correct
for input [0, 1] expected 1 predicted 0.9684 which is correct
for input [1, 0] expected 1 predicted 0.9684 which is correct
for input [1, 1] expected 0 predicted 0.03951 which is correct

Your mileage may vary. Sometimes this simple net will diverge and output for all inputs the 0.666..., or it would need more iterations to train. It's normal as it is more sensitive to starting random weights than more complex models. NN libraries suffer from that too, but they can mitigate it by smarter weights initialization (1, 2, 3).

You can play around with learning rate (alpha) and see how it affects the speed of learning. It is one of the most important hyperparameters in machine learning world. For this example, the orders of around 0.1 is used. In real world applications, 0.001, 0.0001 or even less are used, along with some decay rate. See 1, 2, 3 for details.

Second script

This one is more flexible. The variables are stored in array and the for loops more resemble the matrix operations. With HIDDEN = 3, it behaves the same as the first script.

import random
import math
VARIANCE_W = 0.5
INPUTS = 2
HIDDEN = 3
OUTPUTS = 1
hidden_weights = []
for _ in range(HIDDEN):
    hidden_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(INPUTS)])
hidden_bias = [0] * HIDDEN
output_weights = []
for _ in range(OUTPUTS):
    output_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(HIDDEN)])
    
output_bias = [0] * OUTPUTS
def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))
def sigmoid_prime(x): # x already sigmoided
    return x * (1 - x)
def predict(inputs):
    hiddens = []
    for i in range(HIDDEN):
        hidden = 0
        for j in range(INPUTS):
            hidden += hidden_weights[i][j] * inputs[j]
        hidden = sigmoid(hidden + hidden_bias[i])
        hiddens.append(hidden)
    
    outputs = []
    for i in range(OUTPUTS):
        output = 0
        for j in range(HIDDEN):
            output += output_weights[i][j] * hiddens[j]
        output = sigmoid(output + output_bias[i])
        outputs.append(output)
    
    return output
def learn(inputs,targets,alpha=0.1):

Example output:

1000 mean squared error: 0.25001130384302916
2000 mean squared error: 0.24998178999223236
3000 mean squared error: 0.2498882299395278
4000 mean squared error: 0.2487285310117196
5000 mean squared error: 0.19301651904542888
6000 mean squared error: 0.06884957172680736
7000 mean squared error: 0.011780981900514784
8000 mean squared error: 0.005242561964051928
9000 mean squared error: 0.0032150923746817523
10000 mean squared error: 0.0022766661560815666
for input [0, 0] expected 0 predicted 0.02494 which is correct
for input [0, 1] expected 1 predicted 0.9528 which is correct
for input [1, 0] expected 1 predicted 0.9563 which is correct
for input [1, 1] expected 0 predicted 0.06596 which is correct

You can play around with the number of hidden units.

Third script

This one is exactly as second script, except the activation function for hidden layer is tanh, instead of sigmoid. Also, sigmoid_prime is replaced by tanh_prime in hidden layer during training.

import random
import math
VARIANCE_W = 0.5
INPUTS = 2
HIDDEN = 3
OUTPUTS = 1
hidden_weights = []
for _ in range(HIDDEN):
    hidden_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(INPUTS)])
hidden_bias = [0] * HIDDEN
output_weights = []
for _ in range(OUTPUTS):
    output_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(HIDDEN)])
    
output_bias = [0] * OUTPUTS
def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))
def sigmoid_prime(x): # x already sigmoided
    return x * (1 - x)
def tanh(x):
    return math.tanh(x)
def tanh_prime(x): # x already tanhed
    return 1 - x * x
def predict(inputs):
    hiddens = []
    for i in range(HIDDEN):
        hidden = 0
        for j in range(INPUTS):
            hidden += hidden_weights[i][j] * inputs[j]
        hidden = tanh(hidden + hidden_bias[i])
        hiddens.append(hidden)
    
    outputs = []
    for i in range(OUTPUTS):
        output = 0
        for j in range(HIDDEN):

Example output:

1000 mean squared error: 0.0061729073060459915
2000 mean squared error: 0.0015754154512582549
3000 mean squared error: 0.0008510341785755552
4000 mean squared error: 0.0005731795112170564
5000 mean squared error: 0.00042899519927730044
6000 mean squared error: 0.0003414967710025598
7000 mean squared error: 0.0002830319449863964
8000 mean squared error: 0.00024133098714384138
9000 mean squared error: 0.00021014923444340877
10000 mean squared error: 0.0001859852279128196
for input [0, 0] expected 0 predicted 0.01346 which is correct
for input [0, 1] expected 1 predicted 0.9888 which is correct
for input [1, 0] expected 1 predicted 0.9862 which is correct
for input [1, 1] expected 0 predicted 0.01573 which is correct

See how faster and better it converges than only sigmoid one. tanh graph

The common choice for activation function is relu. Has plenty advantages, but for me the most important one is speed. In game tree search, doing bunch of relus is much faster than doing bunch of tanhs.

For exercise, replace tanh and tanh_prime with relu and relu_prime accordingly and see how does it affect training. I remark that relu seems more sensitive to weights initialization, in the simplest xor example it diverges more often.

def relu(x):
    return max(0,x)


def relu_prime(x):
    if x <= 0: return 0
    else: return 1

Fourth script

The same as third script (tanh in hidden layer), but this time with momentum 1, 2.

Let's add momentum next to the weights. Here instead of updating weights directly, we update the momentum and then the weights according to the momentum.

import random
import math
VARIANCE_W = 0.5
INPUTS = 2
HIDDEN = 3
OUTPUTS = 1
hidden_weights = []
hidden_momentum = []
for _ in range(HIDDEN):
    hidden_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(INPUTS)])
    hidden_momentum.append([0 for _ in range(INPUTS)])
hidden_bias = [0] * HIDDEN
output_weights = []
output_momentum = []
for _ in range(OUTPUTS):
    output_weights.append([random.uniform(-VARIANCE_W,VARIANCE_W) for _ in range(HIDDEN)])
    output_momentum.append([0 for _ in range(HIDDEN)])
    
output_bias = [0] * OUTPUTS
def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))
def sigmoid_prime(x): # x already sigmoided
    return x * (1 - x)
def tanh(x):
    return math.tanh(x)
def tanh_prime(x): # x already tanhed
    return 1 - x * x
def predict(inputs):
    hiddens = []
    for i in range(HIDDEN):
        hidden = 0
        for j in range(INPUTS):
            hidden += hidden_weights[i][j] * inputs[j]
        hidden = tanh(hidden + hidden_bias[i])
        hiddens.append(hidden)
    

Example output:

1000 mean squared error: 0.00044142174881360634
2000 mean squared error: 0.00020205770098357436
3000 mean squared error: 0.00013037626876152424
4000 mean squared error: 9.606031869410035e-05
5000 mean squared error: 7.598012945270696e-05
6000 mean squared error: 6.281017989283364e-05
7000 mean squared error: 5.351418734233088e-05
8000 mean squared error: 4.660449366354482e-05
9000 mean squared error: 4.1269082890237664e-05
10000 mean squared error: 3.702491046845857e-05
for input [0, 0] expected 0 predicted 0.002033 which is correct
for input [0, 1] expected 1 predicted 0.9932 which is correct
for input [1, 0] expected 1 predicted 0.9932 which is correct
for input [1, 1] expected 0 predicted 0.007203 which is correct

New hyperparamater has been introduced, commonly called lambda. It says how much the past updates should affect the current update. You can play with it, the usual values are within [0.8,0.99] range. It is very important hyperparameter, next to learning rate.

See how much faster does it converge than the version without momentum. Note that X is 1000 iterations instead of 10000. momentum graph

Stochastic gradient descent with momentum is a standard baseline training algorithm. There are various optimizations proposed with Adam being the preferred choice. Using some formula they update individual momentum, learning rates and possibly other things. Depending on the problem they may converge much faster than standard SGD.

Summary

I wrote this article because in the beginning I struggled to understand how to train neural networks. Most tutorials introduce heavy math, then use python libs with few lines and call it a day.

import nnlib
import mnist

nn = nnlib.NeuralNetwork()
nn.learn(mnist)
nn.predict(mnist)
# yay we have trained our first neural network completely by ourselves!

I wouldn't learn much from that. I am dumb as the NNs I create, I learn by examples. I need many many similar, slightly increasing in difficulty examples to grasp the new concept. Of course the neural network world is much bigger than that, but hopefully the scripts I provided will give you some insight into NNs and encourage you to learn more about them.

Then after grasping the basic concepts, I would use the libraries. Most common are pytorch and keras/tensorflow. No need to implement Adam or other optimization, smarter weights initialization, regularization etc. And it's easy to prototype and test various NN architectures (multiple hiddden layers, different units) to see which one is the best.

I'd like to thank Wontomino for providing some feedback on this article.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content

Open Source Your Knowledge, Become a Contributor

Introduction

First script

Second script

Third script

Fourth script

Summary

PYTHON: BEGINNER QUIZ (10 Questions)

Simple Python Test

9a7ba

Python from Zero to Hero