Neural Network xor example from scratch (no libs)

jacek
13.5K views

This is a preview

This playground version isn't public and is work in progress.

Introduction

XOR example is a toy problem in machine learning community, a hello world for introducing neural networks. It means you have to build and train the neural network so that given 2 inputs it will output what a XOR function would output (at least close to it).

This isn't math heavy explanatory tutorial, there are plenty of them out there. Just google "neural network xor example", and you'll get for example 1 and 2 with detailed explanation. I assume you have some vague knowledge of neural networks and try to write a simple one. This article is just a bunch of simple python scripts that implement neural networks. No numpy or other libraries are used, so they should be easily translatable to other languages.

Codingame doesn't have NN specific libraries. For Codingame people mostly train locally using NN libraries, and just copy the learned weights and implement the forward propagation themselves in their language.

All the scripts use stochastic gradient descent to train the neural network, one data row at a time, so no need for matrix tranpositions, which would be required for mini-batch. The loss function is mean squared error.

First script

This is the simplest script, an implementation of temporary nn

Here the neural network is just a bunch of loosely written variables.

Example output:

epoch 1000 mean squared error: 0.24950644703013328
epoch 2000 mean squared error: 0.21033216889786982
epoch 3000 mean squared error: 0.06970183505211733
epoch 4000 mean squared error: 0.010193355003568688
epoch 5000 mean squared error: 0.004621174723851184
epoch 6000 mean squared error: 0.002889802406873843
epoch 7000 mean squared error: 0.0020756513163735827
epoch 8000 mean squared error: 0.001609116960177383
epoch 9000 mean squared error: 0.0013089350608506563
epoch 10000 mean squared error: 0.0011005064043563537
for input [0, 0] expected 0 predicted 0.02906 which is correct
for input [0, 1] expected 1 predicted 0.9684 which is correct
for input [1, 0] expected 1 predicted 0.9684 which is correct
for input [1, 1] expected 0 predicted 0.03951 which is correct

Your mileage may vary. Sometimes this simple net will diverge and output for all inputs the 0.666..., or it would need more iterations to train. It's normal as it is more sensitive to starting random weights than more complex models. NN libraries suffer from that too, but they can mitigate it by smarter weights initialization (1, 2, 3).

You can play around with learning rate (alpha) and see how it affects the speed of learning. It is one of the most important hyperparameters in machine learning world. For this example, the orders of around 0.1 is used. In real world applications, 0.001, 0.0001 or even less are used, along with some decay rate. See 1, 2, 3 for details.

Second script

This one is more flexible. The variables are stored in array and the for loops more resemble the matrix operations. With HIDDEN = 3, it behaves the same as the first script.

Example output:

1000 mean squared error: 0.25001130384302916
2000 mean squared error: 0.24998178999223236
3000 mean squared error: 0.2498882299395278
4000 mean squared error: 0.2487285310117196
5000 mean squared error: 0.19301651904542888
6000 mean squared error: 0.06884957172680736
7000 mean squared error: 0.011780981900514784
8000 mean squared error: 0.005242561964051928
9000 mean squared error: 0.0032150923746817523
10000 mean squared error: 0.0022766661560815666
for input [0, 0] expected 0 predicted 0.02494 which is correct
for input [0, 1] expected 1 predicted 0.9528 which is correct
for input [1, 0] expected 1 predicted 0.9563 which is correct
for input [1, 1] expected 0 predicted 0.06596 which is correct

You can play around with the number of hidden units.

Third script

This one is exactly as second script, except the activation function for hidden layer is tanh, instead of sigmoid. Also, sigmoid_prime is replaced by tanh_prime in hidden layer during training.

Example output:

1000 mean squared error: 0.0061729073060459915
2000 mean squared error: 0.0015754154512582549
3000 mean squared error: 0.0008510341785755552
4000 mean squared error: 0.0005731795112170564
5000 mean squared error: 0.00042899519927730044
6000 mean squared error: 0.0003414967710025598
7000 mean squared error: 0.0002830319449863964
8000 mean squared error: 0.00024133098714384138
9000 mean squared error: 0.00021014923444340877
10000 mean squared error: 0.0001859852279128196
for input [0, 0] expected 0 predicted 0.01346 which is correct
for input [0, 1] expected 1 predicted 0.9888 which is correct
for input [1, 0] expected 1 predicted 0.9862 which is correct
for input [1, 1] expected 0 predicted 0.01573 which is correct

See how faster and better it converges than only sigmoid one.

The common choice for activation function is relu. Has plenty advantages, but for me the most important one is speed. In game tree search, doing bunch of relus is much faster than doing bunch of tanhs.

For exercise, replace tanh and tanh_prime with relu and relu_prime accordingly and see how does it affect training. I remark that relu seems more sensitive to weights initialization, in the simplest xor example it diverges more often.

def relu(x):
    return max(0,x)


def relu_prime(x):
    if x <= 0: return 0
    else: return 1
Open Source Your Knowledge: become a Contributor and help others learn. Create New Content