AI exercise: Implementing MENACE
Your task is to implement a simulation of a machine, originally built in the 1960's by Donald Michie, called MENACE . MENACE was an early experiment in reinforcement learning, and it learnt to play a perfect game of Tic-Tac-Toe, by playing against a human player (Michie himself). The most peculiar thing about the original MENACE is that it was built from matchboxes and coloured beads. Computers were not available to that many people in those days ...
You, however, do have access to computers so you can simulate MENACE on that. The original MENACE, with about 300 matchboxes representing possible board positions a player can encounter, was specialized to play only one side of the game. You shall implement a double sided version of this machine, so that it can be trained by playing against itself!
 D. Michie, "Trial and error", in S.A. Barnett and A. McLaren (Eds.), Science Survey, part 2, pp. 129-145, Penguin Books Ltd., 1961. Republished in D. Michie, On Machine Intelligence, 2nd edition, Ellis Horwood, 1986.
The game Tic-Tac-Toe
(3-by-3 noughts and crosses)
The board is a grid of nine squares - three rows, three columns.
Initially all squares are empty.
Two players, X and O, take turn making their moves. X always starts and marks an empty square with an X (surprise!). The other player responds by marking an empty square with an O, and so on.
Once marked, a square can not be changed, i.e. "pieces" don't move.
The objective is to get three of your own marks in a row (horizontally, vertically or diagonally). If this happens, the game ends immediately, with a win for the player with three marks in a row.
If the board is filled (i.e. after 9 moves - 5 by X and 4 by O) and no winning row exists, the game ends in a draw.
(Machine Educable Noughts And Crosses Engine)
The machine consists of one matchbox for each possible state (board position) that the player can encounter during a game. Initially, all boxes are filled with a number of coloured beads where the different colours represent different squares on the board. For example, blue could represent the square in the middle.
For each move during a game, draw a bead at random from the box representing the current state. Put your mark on the square indicated by the colour of that bead. Remember which bead you drew from which box. Go on like this until the game ends.
When the game ends, there are three cases:
- If the machine lost, discard all beads drawn during the game. This is to decrease the probability of making those moves in those states again.
- If the game ends in a draw (no side managed to get three marks in a row), return each bead drawn during the game to the box from which it was drawn. In other words, reset the machine to the configuration it had when the game started.
- If the machine won, return each drawn bead to its box, and add one more bead of the same colour. This increases the probability of making those moves again.
If we fill each box with 9 colours from the start, the machine will sometimes make illegal moves (trying to mark non-empty squares). The machine will work anyway (i.e. learn the rules of the game), if we treat making illegal moves as a loss (for that box only - earlier boxes should not suffer). But it is more efficient to give it the rules from the start, by initializing the boxes with only those colours that correspond to legal moves from that state. The machine should not have to learn the rules of the game, only how to play it well.
When the machine wins or loses we probably want to give late moves in the game more credit/blame than earlier ones. This can be achieved by letting the initial number of beads in a box depend on how early/late in the game the corresponding state occurs. Boxes corresponding to states immediately before the game ends should perhaps have only one bead of each colour (if we lose from such a state, we probably don't want to make that move again, ever). The boxes corresponding to early states, on the other hand, should have many beads of each colour.
The number of possible states (board positions) is bounded by 3^9 = 19683. The actual number is much lower than that, though, due to symmetries, rotations and since games end immediately after a win. If such redundant states are removed, each player has approximately 300 possible states to consider. In the 60's, when MENACE was actually built, this was important, but using a modern computer you probably don't have to optimize the state space for size.
The set of possible states for player X is separate from the set of possible states for player O. In other words, there are no resource conflicts - each box will only be used by one of the players. So, a machine that plays against itself, can just as well be described as two machines playing against each other.
If both players play optimally, all games end in a draw.
The original MENACE, with its 300 boxes, learned to play a perfect game in 150-200 games against an opponent (D. Michie himself) who tried to "follow" the machine in game skill, so that he always played on approximately the same skill level as the machine. If the machine had played against an optimal opponent already from the start, it would not have worked (since an optimal opponent can always force a draw, or win so many times that the box representing the first state becomes empty - in effect, make the machine refuse to play). You, who are to implement a double sided version of MENACE, don't have this problem, since it will play against itself.
Detailed implementation notes
You can freely chose the programming language of the implementation. In most languages you'll want to use arrays, but in Lisp and Prolog you should look into association lists.
You can ignore symmetries. You can actually create the 19683 states, but you can also create states when you reach them for the first time, which yields something like 4000 reachable states. A disadvantage of ignoring symmetries is that the computer will need a longer time to learn: it needs to learn all symmetrical situations separately. So it may need to play a few thousand games rather than a few hundred - still this can be done in minutes.
The interesting part of the assignment is to observe the learning process. Some suggestions are
- print the result of each training game by one letter (w, d, or l for X win, draw or X lose). Thus a training session gives a string of a few thousand letters, and you should see a tendency towards more and more draws.
- make it possible to stop the training after a number of games and observe the state - either by looking up a position (typically the initial position is interesting) or by playing a game.
- playing a game should be handled differently from training. When the machine plays a game for real, it should play greedily. This means that it should then always select the move which is currently believed to be the best (has the greated number of remaining beads), instead of drawing beads at random from the box.
What do you do when a box becomes empty? This means that the machine has lost too many games from this position, whatever move it makes. One easy solution is to reset the box (maybe all moves are equally bad) and hope that the machine learns to avoid this position. Another solution is not to remove the last bead (it lost, but less often than the other moves).
You want to start out with enough beads, particularly in the first boxes. Olle reports that the first random training games lead to a win for X in 60% of the games, a draw in 10% and a win for O in 30%. So in the beginning, O will lose many beads. Perhaps a solution is to reward a win by O by tripling (instead of doubling) the winning beads? I don't know, but I expect you to try out things like this, and some ideas of your own!
What to hand in
- the code
- a report, telling what you tried, and what the outcome was. This includes some test runs with appropriate explanations.
When and how
- Deadline: Monday May 22
- Hand in to: depending on programming language, as assignment 2.