Monthly Archives: March 2017

Training a Tic-Tac-Toe AI with Reinforcement Learning (Part 2)

(Source Code in Swift 3.0 at BitBucket)

In the previous tutorial we built a Tic-Tac-Toe with traditional AI using the Minimax algorithm along with other more simplistic techniques. Using the same codebase I am going to demonstrate another method using Reinforcement Learning.

Machine Learning Methods

Today, the phrase Machine Learning is mostly linked with Neural Networks as they have exploded in popularity and use over the last couple of years. So where does Reinforcement Learning (RL) fit in? RL is not as much an algorithm but a type of learning, and is by no means antagonistic to Neural Networks, which are a tool rather than a type. To understand all that a little better, let’s look into the 3 main types of Machine Learning:

  1. Supervised Learning: Given a (usually very large)  set of input-output pairs (X1,Y1), (X2,Y2) … (Xn,Yn), we will produce a function where Y = F(X).
    Example: Computer Vision. Given the set of images (these are the Xs), pre-labeled (by a human) as containing or not the representation of an apple (the Ys, which in this case have boolean values), derive a function that takes an image as an input and returns a yes or a no as in representing or not an apple.
  2. Unsupervised Learning: Given a (usually very large) set of data points X1, X2 … Xn, each of these containing a number of “features”, we will produce distribution that can give us the probability P(X=Xi) of a new random X point having a particular value Xi.
    Example: Email Spam Filter. Given a set of emails (these are the Xs), each containing a number of words (these are the features, represented as the number of times each word appears), and also pre-labeled as spam or not, derive the distribution which gives the probability of a new email being spam or not.
  3. Reinforcement Learning: Given an “environment” with discrete states S, each containing a set of possible actions A given as pairs: (s1, {a1, a2…an} )(s2, {a1, a2…an} ) … (sn, {a1, a2…an} ), explore the environment by running the actions on every state and backpropagating their rewards or penalties. Do that until you can derive a policy P(s) = a which given a state si returns the best possible action ai with the goal or reaping the maximum total reward.
    Example: A robot navigating through a room looking for the quickest way to reach it’s destination. (see below)


Basic Ideas of Reinforcement Learning

The basic idea of RL is that we have an agent who starts off by knowing nothing about its environments but is willing to learn. He does so by trying out different things from the set of available actions that exists in every state. Let’s start with the most typical example given, that of a robot, from here on called the agent, navigating a room, represented as simple grid:

The cells with numbers written in them are end states. Obviously the +1 is a positive reward and the -1 is a negative reward or penalty. We will pretend for the sake of simplicity that all cells have a reward value attached to them, which is 0 for all cells other than the 2 highlighted. We will call this value utility.

The lightly shaded square at (1,1) is an obstacle where the agent cannot step. The agent’s goal is to find a route, here called a policy, that represents the most efficient path to the reward. Although the agent knows almost nothing about its environment, it will have a default policy that will serve as a starting point for the learning process. Without one, it could end up doing circles around the obstacle and never reach an end state.

Note: The rewards can be as close or as far as one wishes but there is a reason why they are mostly represented as being close: Getting close to one often, in real life situations, brings you close to the other. In a first person shooter for example, to kill an opponent you need to get “physically” close to him, which also brings you potential closer to your own  death as he will be able to do more hard from a closer distance.

So… in the example above, a state is the position the agent occupies in the grid. As you can see it is a 4×3 grid so it has a total of 12 states. How this will be encoded is irrelevant for the moment. Just think of a table or dictionary which maps states to utilities and can be accessed by U[S] = V.

Every time the agent transitions from a square, at state s, into a neighbouring one it produces state s’ (from the set of possible states S). It will look it up in the table and if a value for this state has not been set (the agent is visiting this square for the first time), then it will update the table with the appropriate value (most likely 0). If the state has been visited and a value was set, then it will update the previous state following the formula:

At first, the agent is visiting cells with a value of zero and no update needs to happen. When however it visits the reward cell of +1 it will back-propagate this value into the previous one. Again, at the time when the agent steps into the reward cell, top right, it is at state s’ (the new state).

Now, the second time the agent runs, it will have one more cell with a non zero value, cell (0,3) which is the one before the end as seen above. When it steps into it, it will back-propagate the value to cell (0,2) behind it.

In its next move the agent will update the value of cell (0,3) once again, using the formula above.

After a few runs, the path will have been reinforced as the +1 of the goal state will have propagated backwards along the path until it reached the end, as seen below.


The only problem is that in this way, the agent has no way of testing if his policy is correct. In fact, what we have done here is anti-learning! The agent just reinforces what it has already assumed to be correct, which is the original policy which was chosen with no information about the environment. What we need therefore is the ability to randomise actions here and there so if they provide for a better solution they will have a chance of being selected in the future. We will use a variable called epsilon that will randomly pick another path rather than the original.

After a large number of runs the whole board will have settled for values that are more or less stable, in other words, more runs will not significantly change the numbers below.

The final step, and after many iterations of that, is where we take the derived table and walk through it in a greedy fashion, transitioning to the cell with the highest score.

RL for Tic Tac Toe States

Of course we are not building a robot in this tutorial, at least not yet! It is however fairly easy to translate the concept described above into board configurations for out game.

We will begin by knowing nothing, which means that all configurations will have a utility of zero apart from the winning states which will have a +1 and -1 respectively depending on whether the player under training won or lost.

Adding the RL Strategy Class

As we have used the Strategy Pattern to abstract away the AI algorithms used by the Player object, it is very easy to add a new technique. First we will create an internal class that will hold the training data. It will be a kind of dictionary that will map board states to values.

So we have a very  simple hash function that will convert a board into a string to be used as a key for the dictionary. Access to this table is through a subscript and the only thing to take notice is that when an entry that is required does not yet exist, the table will create it on the fly setting its value according to an evaluation (score) function and then return it.

Learning through Self-Play

Now we have build a strategy that can learn by backtracking from an end state (either win or loss for the player that uses it), but how do we go about using it? One way is to have a human player play against it. This would have the advantage of training against an actual intelligence but it would require significant effort on behalf of the player who would need to play games in the range of a few thousands at least. What we should therefore do is have agents play against each other. One of the two will use the RL technique and the other can (for the moment) play at random.

I have built a controller that performs such self play and measures the results on the screen.

As a base case, we will have two random players play against each other as seen above. What we expect is that the distribution of victories will be roughly equal between them.

and that is what we have… Now let’s try by changing the ‘O’ player into an RL strategy.

Bingo! We are learning!