Training a Tic-Tac-Toe AI with Reinforcement Learning (Part 2)

In the previous tutorial we built a Tic-Tac-Toe with traditional AI using the Minimax algorithm along with other more simplistic techniques. Using the same codebase I am going to demonstrate another method using Reinforcement Learning.

Machine Learning Methods

Today, the phrase Machine Learning is mostly linked with Neural Networks as they have exploded in popularity and use over the last couple of years. So where does Reinforcement Learning (RL) fit in? RL is not as much an algorithm but a type of learning, and is by no means antagonistic to Neural Networks, which are a tool rather than a type. To understand all that a little better, let’s look into the 3 main types of Machine Learning:

  1. Supervised Learning: Given a (usually very large)  set of input-output pairs (X1,Y1), (X2,Y2) … (Xn,Yn), we will produce a function where Y = F(X). Example: Computer Vision. Given the set of images (these are the Xs), pre-labeled (by a human) as containing or not the representation of an apple (the Ys, which in this case have boolean values), derive a function that takes an image as an input and returns a yes or a no as in representing or not an apple.
  2. Unsupervised Learning: Given a (usually very large) set of data points X1, X2 … Xn, each of these containing a number of “features”, we will produce distribution that can give us the probability P(X=Xi) of a new random X point having a particular value Xi. Example: Email Spam Filter. Given a set of emails (these are the Xs), each containing a number of words (these are the features, represented as the number of times each word appears), and also pre-labeled as spam or not, derive the distribution which gives the probability of a new email being spam or not.
  3. Reinforcement Learning: Given an “environment” (ex. a chess board) with a set of discrete states some of which contain rewards and penalties (ex. board configurations, some of which are winning for a given player) and available actions (ex. all moves allowed for a given player), as pairs: (S1, {A1, A2…An} )(S2, {A1, A2…An} ) … (Sn, {A1, A2…An} ), explore the environment by running the actions and evaluating their rewards. Do that until you can derive a policy P(S) = A which given a state S returns the best possible actions A with the goal or reaping the maximum reward. Example: A robot navigating through a room looking for the quickest way to reach it’s destination. (see below)


Basic Ideas of Reinforcement Learning

The basic idea of RL is that we have an agent who starts off by knowing nothing about its environments but is willing to learn. He does so by trying out different things from the set of available actions that exists in every state. Let’s start with the most typical example given, that of a robot, from here on called the agent, navigating a room, represented as simple grid:

Obviously the +1 is a positive reward and the -1 is a negative reward or penalty. We will pretend for the sake of simplicity that all cells have a reward value attached to them, which is 0 for all cells other than the 2 highlighted. We will call this value utility.

The lightly shaded square near the start is an obstacle where the agent cannot step. The agent’s goal is to find a route, here called a policy, that represents the most efficient path to the reward.

Note: The rewards can be as close or as far as one wishes but there is a reason why they are mostly represented as being close: Getting close to one often, in real life situations, brings you close to the other. In a first person shooter for example, to kill an opponent you need to get “physically” close to him, which also brings you potential closer to your own  death as he will be able to do more hard from a closer distance.

So… in the example above, a state is the position the agent occupies in the grid. As you can see it is a 4×3 grid so it has a total of 12 states. How this will be encoded is irrelevant for the moment. Just think of a table or dictionary which maps states to utilities and can be accessed by U[S] = V.

Every time the agent transitions from a square, at state s, into a neighbouring one it produces state s’ (from the set of possible states S). It will look it up in the table and if a value for this state has not been set (the agent is visiting this square for the first time), then it will update the table with the appropriate value (most likely 0). If the state has been visited and a value was set, then it will update the previous state following the formula:

At first, the agent is visiting cells with a value of zero and no update needs to happen. When however it visits the reward cell of +1 it will back-propagate this value into the previous one. Again, at the time when the agent steps into the reward cell, top right, it is at state s’ (the new state).

The more the agent uses the policy, ie. walks the path, it will propagate the values all over the board.

The final step, and after many iterations of that, is where we take the derived table and walk through it in a greedy fashion, transitioning to the cell with the highest score.

Adding the RL Strategy Class

As we have used the Strategy Pattern to abstract away the AI algorithms used by the Player object, it is very easy to add a new technique.