(Code on BitBucket)

Whatever is hard to achieve is interesting to try. And neural networks are definitely difficult to both understand and implement, at least at firs, at least for me!

Here, I have documented my effort to create a hand-writing recogniser from scratch, following tutorials and code examples from “Data Science from Scratch” from Joel Grus.

# Theory

If you are entirely new to I strongly recommend that you read through this tutorial, which lays the foundations by introducing the reader to the smallest network possible, that with 1 neuron, also known as a perceptron.

#### A Better Intuition

What the tutorial above showing, is how to calibrate the weights of a single neuron giving it training input and measuring it against a known output. It is well written and there is a reference to yet another, older tutorial which is just as excellent (and I believe the original one). Reading through it however, there were a few points that although clearly stated, did not sink in at all, as I was missing an intuition into the basics of this complicated topic. So, with the full understanding that we all differ in the way we visual and comprehend, here is my effort:

Although the earliest pioneers of the field where biologists, the best way to understand neural networks has been for me to forget their biological analogies, since performing what essentially are matrix multiplications and calling that a simulation of a brain is, to me, a marketing ploy (although a benign one)!

Think rather of each neuron as an equation, like the one below.

[pmath size=14] f(x,y,z) = a*x + b*y + c*z [/pmath]

This is a first degree equation with 3 variables. Each variable is an **input** to the neuron (and for the moment, you can think of these inputs as binary).

Next, think that we solve the equation for some **x**, **y** and **z** and evaluate the resulting value. If it is larger than a said *threshold* value, then we tell the neuron to “fire”, if not then it “does’t fire”. In other words we will feed the result to a function that looks like the one below:

What we see here is that if the final number we got from the function is negative, the result will be **0** while if it is positive, regardless by how much, it will be **1**. This is called a “step” function (because it looks like a step, I guess… 😉

So here we have it, 3 variables going in, multiplied respectively by 3 different weights and finally added together to see if they pass a threshold value. This single neuron is also called a *perceptron*, and it comprises the smallest computation unit in a neural network.

Instead of writing down the values on the left of the neuron, we can (and usually will) represent them as neurons. These neurons are called the *input* layer of the network, and the perceptron comprises the *output* layer. You can therefore think of the diagram above as a 2-layered neural network.

Later we will add a 3rd layer between input and output called the *hidden* layer. With these 3 alone, we will try and build our recogniser.

#### Activation Function

The step function we saw above is a kind of activation function. Activation mean that there is a “cut-off” point after which the function returns a 1, or “fires” in neural speak. A step function has too many limitations as it is discontinuous (it is not defined at 0) and does output a range. We will see why this is important later on but here we will present a function that is widely used in neural networks, the *sigmoid function*.

The function derives its name from the “S” (or Sigma in Greek) to which it resembles. The equation that produces that might look daunting at first but we only need to copy it for the sake of its shape rather than understand it mystical implications.

[pmath size=18] f(t) = 1/{1 + e^(- t)} [/pmath]

Plug various values for t and you will get a value between **0.0** and **1.0**, smoothly approaching **0** as t approaches **minus infinity**. For very large values, either positive or negative, the sigmoid will resemble the step function as it will output **≈0.0** and **≈1.0**.

For this project we will create a sigmoid function together with its first derivative and store it in the Utils.swift file

1 2 3 4 5 6 |
func sigmoid(t:Double, deriv:Bool=false) -> Double { if deriv == true { return t * (1 - t) } return 1.0 / (1.0 + exp(-t)) } |

# Practice

Let’s do it!

#### What are we building here?

Will will train a network to recognise had written digits. Ok! There is more than one catch to this… First the resolution is very low… as we will be drawing on a 5×5 canvas. Second the digits need to take advantage of the whole canvas. Third, it will not work every time. Now… to my defence, the book by Grus that I quoted above has the same flaws and the python code.

#### Training Set

If you read the tutorial above you would know that we need to feed the network with examples of data and its classification. In our case the data is already drawn digits represented in ASCII art like so

1 2 3 4 5 6 |
@@@@@ ....@ @@@@@ ....@ @@@@@ 3 |

The first 5 lines is the digit and the one below is the correct result, in our case a 3. What we are hoping for is that after the network has been trained again that, it will recognise a similar digit it has never seen and classify it as a 3.

1 2 3 4 5 6 |
@@@@@ ....@ .@@@@ ....@ .@@@. 3 |

and we will be feeding this too. What we are hoping for is that the network, looking at these examples will infer what differentiates a **3** from a **5**.

To feed this data we need to turn it into some kind of data structure and for this we will use an array where marks (@) translate to 1s and gaps (.) to 0s. So the first example above will be:

1 2 3 4 5 |
let digit3 = [1.0,1.0,1.0,1.0,1.0, 0.0,0.0,0.0,0.0,1.0, 0.0,1.0,1.0,1.0,1.0, 0.0,0.0,0.0,0.0,1.0, 1.0,1.0,1.0,1.0,1.0] |

Hopefully if you squint your eyes you will recognise the digit int the numbers! 😉

#### Loading the Data

We open the file and break it in lines. The easiest way of doing it would be to load it all as a string and then separate it into an array on a new line character. Although this would be ok for this case we still want to generalise a little bit and account for a case of a large training set. We also want to define the encoding and the new line character in case it changes across platforms. So we will read the file in batches*:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import Foundation enum ReaderError: Error { case CannotReadFile } class Reader : Sequence, IteratorProtocol { typealias Element = String var buffer : Data var handler : FileHandle! var delim : Data var end : Bool = false let encoding : String.Encoding let chunkSize : Int init?(path: String, delimiter: String = "\n", encoding : String.Encoding = String.Encoding.utf8, chunkSize : Int = 4096) { self.chunkSize = chunkSize self.encoding = encoding guard let fhandler = FileHandle(forReadingAtPath: path), let dlimiter = delimiter.data(using: encoding) else { return nil } self.handler = fhandler self.delim = dlimiter self.buffer = Data(capacity: chunkSize) } func next() -> String? { precondition(self.handler != nil, "Attempt to read from closed file") while !self.end { if let range = buffer.range(of: delim) { let line = String(data: self.buffer.subdata(in: 0..<range.lowerBound), encoding: encoding) self.buffer.removeSubrange(0..<range.upperBound) return line } let read = handler.readData(ofLength: self.chunkSize) if read.count > 0 { self.buffer.append(read) } else { self.end = true if buffer.count > 0 { let last = String(data: buffer as Data, encoding: encoding) self.buffer.count = 0 return last } } } return nil } func close() -> Void { self.handler?.closeFile() self.handler = nil } deinit { self.close() } } |

** I want to thank Martin R for this post.*

Here we are using a sequence so we can loop through the lines in a for in loop. Be careful about something though. Our Sequence is a class, which means that it will be passed by reference, which means that if we try and loop over it twice it will not give us data because it will be on the end of the file already. We will not be looping more than once so its a small compromise.

All the reader does is open a file and feed it to us, line by line. To parse the contents of these lines we will build another class that takes a Reader Sequence and looks into its output for known characters.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
import Foundation enum ParserError: Error { case UnknownCharacter(character:Character) } class Parser { var reader:Reader var width:Int var height:Int init(with reader:Reader, width:Int, height:Int) { self.reader = reader self.width = width self.height = height } func parse(off:Character="0", on:Character="1") throws -> TrainingSet { var trainingSet = TrainingSet(size:width*height) var characterVector = Vector() var ln = 0 for line in self.reader { if ln == height { trainingSet.add(input: characterVector, target: line[1]) characterVector = Vector() ln = 0 } else { for character in line.characters { var value:Double if character == on { value = 1.0 } else if character == off { value = 0.0 } else { throw ParserError.UnknownCharacter(character:character) } characterVector.append(value) } ln += 1 } } return trainingSet } } |

We initialise this class by defining for it the width and height of our ASCII art canvas. In out case it is 5×5 but hopefully we could generalise into larger training sets.

We save the parsed data as an input vector (the digit) and an output vector representing the number it maps to. We saw how we vectorised the ASCII digit, but what about the resulting number? We will use a vector of size 10 which will be all 0s except one element, representing the index of the number from 0 to 9. So 3 will become:

1 |
let three = [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] |

Remember the array is zero-indexed.

Vectors and Matrices are used throughout machine learning to such a degree that Google is investing money to build a vector processing unit called TPU. We will definte a couple of things that will help us down the way.

1 2 3 4 5 6 7 8 9 10 |
typealias Vector = [Double] infix operator ** func **(left:Vector, right:Vector) -> Double { precondition(left.count == right.count) var sum = 0.0 for (l, r) in zip(left, right) { sum += l * r } return sum } |

The inner product of a vector is the elements multiplied and added. Remember how our perceptron took n inputs, multiplied them with its weights and added them together? This was the inner product of the neuron, represented as a vector of weights.

#### Building the UI

We are not going to get into this in great detail as it is not the focus of this tutorial. Suffice to say that we are splitting a UIView in a 5×5, by adding subviews to act as “pixels” of sorts.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
static let GRID = 5 override func awakeFromNib() { super.awakeFromNib() self.createGrid() } private func createGrid() { for row in 0..<DigitView.GRID { for col in 0..<DigitView.GRID { let square = UIView() square.tag = row*DigitView.GRID+col self.addSubview(square) } } } override func layoutSubviews() { super.layoutSubviews() let w = self.bounds.size.width / CGFloat(DigitView.GRID) let h = self.bounds.size.height / CGFloat(DigitView.GRID) var fr = CGRect(x:0, y:0, width:w, height:h) for row in 0..<DigitView.GRID { for col in 0..<DigitView.GRID { let square = self.subviews[(row*DigitView.GRID+col)] square.frame = fr fr.origin.x += fr.size.width; } fr.origin.x = 0 fr.origin.y += fr.size.height } } |

When the user drags his finger over one of them, they will light up. When the user lifts his finger the pixels that have “lit-up” will be 1s and the rest will be 0s in our output array.

1 2 3 4 5 6 7 8 |
func compute() { var serialized = Array(repeating: 0.0, count: DigitView.GRID*DigitView.GRID) for s in self.digitView.subviews { if s.backgroundColor == UIColor.on { serialized[s.tag] = 1.0 } } } |

#### Modelling our Network

Let’s start building the actual neural network. We said that the inputs will be a 25-sized vector to represent every pixel of the 5×5 canvas as being on or off, 1 or 0. Out output will be a 10-sized vector to represent the probability of the input being one of our 10 choices, the digits 0 to 9. Ideally we would like to see 0.0s in all but one element which will be 1.0. This will be far from the truth however as we will get a few close matches and all we can do is pick the best. The hidden layer will have 5 neurons.

###### The Neuron

We obviously need to define a single neuron as a class we can instantiate.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import Foundation class Neuron { var weights:[Double] = [] var bias:Double init(weights:[Double], bias:Double=1.0) { self.weights = weights self.bias = bias } func output(input:[Double]) -> Double { precondition(input.count == weights.count, "The input vector is diffent size from the weights") let sum = input ** weights return sigmoid(t: sum) } } |

Here we are encapsulating the sigmoid function we talked about above. A neuron has a list of weights, which is what is getting adjusted during training.

###### The Network

A network is supposed to be composed of a few different layers, each being a collection of neurons. There is typically an input, output and hidden layer, each with different number of neurons. To understand this better think of the input layer as nothing but input vector. We know that our data has a 5×5 resolution in terms of the digits being draw and that we vectorised that into a 25-elements array. So the input layer is nothing but a set of 25 entry points, represented as neurons for simplicity. These are not neurons in the sense that they will be trained or that their weights affect anything, they are just there to feed to rest of the network with the entry data.

The output layer on the other hand will correspond to the output vector which, as discussed, will have 10 elements, each representing **with its position**, a number. So what we really want to do it

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
let almost3 = [0.0,1.0,1.0,1.0,1.0, 0.0,0.0,0.0,0.0,1.0, 0.0,1.0,1.0,1.0,1.0, 0.0,0.0,0.0,0.0,1.0, 0.0,1.0,1.0,1.0,1.0] let results = net.predict(input: almost3) // result = [0.0, 0.0, 0.0, 0.9, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0] var bestGuess = 0.0 var digitIndex = 0 for (i,p) in results.enumerated() { if p > bestGuess { bestGuess = p digitIndex = i } } print("The number is: \(results)") |

Here, we drew a 3 missing a few pixels. Out trained network should be able to understand that this is still a 3 without having seen it before. This is the whole point of Machine Learning of which Neural Networks are a sub-set.

We will define a function that will push data (in the form of the input vector) down the network.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
func forward(input:Vector) -> (Vector,Vector) { var inputs:Vector = input var total:[Vector] = [Vector]() for layer in layers { var outputs = [Double]() for neuron in layer { let biased = inputs + [neuron.bias] // biased.count == 26 let o = neuron.output(input: biased) outputs.append(o) } total.append(outputs) inputs = outputs } return (total.first!, total.last!) } |

###### Training

What the function will return is the output of the output layer and that of the hidden. Have a look at the NetworkTest.swift file and check out how with only this function, and if we manually add the correct weights we can approximate a *XOR* gate. This might give you some more intuition.

In reality of course we do not want to be adjusting weights manually but through a function. The method is called ** back-propagation** and it takes the error between the current output and the desired one, eases it through the function and adjusts the weights accordingly.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
func backward(inputs:Vector, targets:Vector) { let (hidden_outputs, final_outputs) = self.forward(input: inputs) var output_deltas = [Double]() for (output, target) in zip(final_outputs, targets) { let d = sigmoid(t: output, deriv: true) * (output - target) output_deltas.append(d) } guard let output_layer:Layer = self.layers.last else { return } for (i,neuron) in output_layer.enumerated() { for (j,houtput) in (hidden_outputs + [1.0]).enumerated() { neuron.weights[j] -= output_deltas[i] * houtput } } var hidden_deltas = [Double]() for (i, hout) in hidden_outputs.enumerated() { var weights = [Double]() for neuron in output_layer { let w = neuron.weights[i] weights.append(w) } let dot = output_deltas ** weights let r = sigmoid(t: hout, deriv: true) * dot hidden_deltas.append(r) } guard let hidden_layer:Layer = self.layers.first else { return } for (i, hidden_neuron) in hidden_layer.enumerated() { for (j, input) in (inputs + [1.0]).enumerated() { hidden_neuron.weights[j] -= hidden_deltas[i] * input } } } |

Now, what we need is to repeat the training algorithm a large number of times until the values settle.

1 2 3 4 5 6 7 |
func train(set:TrainingSet) { for _ in 0..<10000 { for (input, target) in set { self.backward(inputs: input, targets: target) } } } |

A TrainingSet is just a Sequence class that holds the training data as input/output pairs of Vectors.

#### Results

The network does a good job at predicting some digits and a horrible at predicting others.

The digits 3, 9 and 7 where accuretely predicted although they where quite heavily deformed. On the other hand there is not way that the network will predict the following:

The reason is that the “**1**” in this example is not placed exactly at the center. The pixels that the network was reinforcing to deal with this digit are all turned off. To succeed in these cases, which are by the way much closer to real world problems, the network needs to get more complex, with a layer dedicated to the perception of placement and another to the perception of shape.