How does AlphaZero learn to evaluate a position it has never seen?

Question

Following up from answers to:

My question would be how the neural net "learns" what to do in a position it hasn't encountered. Saying the actual AZ executes an MCTS using the bias + weights from the trained neural net just pushes it back a step to how the neural net calculates these values. If it was through random self-play, with no human knowledge, then how does it decide how to weight a position it has never seen?

score 8 · Answer 1 · answered Dec 12 '17 at 11:46

8

The evaluation function of a chess engine, whether instantiated as a neural net or explicit code, is always able to assign a value to any board position. If you give it a board position, even absurd ones that would never occur in a game, it will be able to spit out a number representing how favorable it is to one player or another. Since the number of board positions in chess is unmanageably gigantic, the training can only occur on an infinitesimal sample of the game tree. The engine is not simply recalling previously calculated values of board positions, but is performing calculations based on the arrangement of the pieces. For a non-neural-net example, part of a chess engine's evaluation might be to add up the value of each piece on its side and subtract the total value of the opponent's pieces. Then, one set of parameter that would be adjusted while training would be the value of each piece.

When the engine is untrained, the values assigned to a position might as well be random since the parameters of the evaluation function start out with (usually) random values. The goal of a training phase is to adjust the parameters of the engine so that it assigns high scores to board positions that are probable winning states for the player.

From the paper on AlphaZero (page 3):

The parameters of the deep neural network in AlphaZero are trained by self-play reinforcement learning, starting from randomly initialised parameters. Games are played by selecting moves for both players by MCTS. At the end of the game, the terminal position is scored according to the rules of the game to compute the game outcome: −1 for a loss, 0 for a draw, and +1 for a win. The neural network parameters are updated so as to minimise the error between the predicted outcome and the game outcome, and to maximise the similarity of the policy vector to the search probabilities.

[math symbols removed from quote]

In summary, during training, AlphaZero played a game against itself. When the game is over, the result of the game and the accuracy of its predictions in how the game would proceed were used to adjust the neural net so that it would be more accurate during the next game. AlphaZero is not keeping a record of every position it has seen, but is adjusting itself so that it can more accurately evaluate any board it sees in the future.

answered Dec 12 '17 at 11:46

Mark H

235
1
6

2

I completely understand your explanation at the algorithmic level, but I am still astounded that it works. I would have thought that the early games would be so random that they would have no learning value. It seems impossible to evaluate the outcome of a move at that stage except by playing it out to checkmate, because that is the only thing that you've been told about. But that checkmate will only happen after a large number of other essentially random stuff has gone on. My gut feeling is that there just is not sufficient meaningful data to draw any conclusions. Why am I wrong? – Philip Roe Dec 12 '17 at 16:53
@PhilipRoe You're right, each game only provides a little bit of information. I've actually written my own chess engine that learns by an evolutionary algorithm. Randomly modified copies of the engine play each other; the losers are deleted and the winners produce more modified copies. It usually takes between 10,000 to 20,000 games for it to figure out just the proper order of piece values (queen, rook, bishop/knight, pawn). It took AlphaZero 44 million games to achieve its skill (table on page 15 of the linked paper). – Mark H Dec 12 '17 at 20:12
Thanks for responding! But Im still astounded. There is the huge space of possible positions to evaluate. But there is also the huge space of possible questions to ask. Anthropomorphically I imagine myself with zero prior knowledge except the rules, and a huge database of games that are played at an almost inconceivable level of incompetence (although I dont suppose all of get remembered) At what point does it occur to me "Hey maybe I should count the pieces" Then how long before counting the pieces seems to a good idea? – Philip Roe Dec 12 '17 at 20:36
1

I find it very hard to imagine, even if some strong hints were provided about "What constitutes a good question?" But without even that, Im impressed that a heirarchy of pieces can be established in 20,000 games. So I find it very hard to accept that the tabula is really rasa. Some minimal instruction about the process of generating and revising your rules (how many, how often?) still seems essential. – Philip Roe Dec 12 '17 at 20:46
2

@PhilipRoe In my program, I tell the engine to count the pieces, but not how much each piece is worth. So, I do tell the engine what to look at, but not how to weight what it sees. AlphaZero is much more tabula rasa. If you're curious: https://github.com/MarkZH/Genetic_Chess – Mark H Dec 12 '17 at 21:28
And in those 44 million games, do we have any idea how many total positions it looked at? I figure the learning could be made more efficient if, e.g. it was tuned to always find forced mates in less than 5 moves. – Roy Koczela Dec 12 '17 at 21:40
Right - it would not qualify as totally self-trained if you told it that one of the key parameters was the value of the pieces. It has to "figure that out" itself - and it does seem like 44 million games would be way too few to do that - although again, I suspect that those "44 million" games included a whole lot of evaluations of "moves" that didn't get "played". I note the paper says that "thinking time" was 40 ms during evaluation. – Roy Koczela Dec 12 '17 at 22:07
2

@PhilipRoe "What constitutes a good question?" is a good question. One of the annoying things about the study of neural nets is that it's still very difficult to determine how the trained net makes decisions. Until these things learn to talk, studying the trained net is as useful as opening up a human chess player's skull to see what they are thinking. – Mark H Dec 12 '17 at 22:54
@RoyKoczela Given about a hundred ply per game plus multiple variations when choosing moves, I wouldn't be surprised if the number of positions seen during training was on the order of 10 or 100 billion. This is an estimate, I'm sure there are papers (probably on arXiv) by the AlphaZero team that goes into detail. – Mark H Dec 12 '17 at 23:01
@Mark, Roy I’m finding this fascinating! Not least for the broader implications. How long has it taken (by which I mean how many generations) for natural selection to produce us, who on a good day might manage a reasonable game of chess, among our other talents? But Im not sure how close the parallel is. Nature gets to start with something very simple that has already evolved for some purpose. She wants to make it either a bit better for that purpose, or to slightly redirect the purpose. But AlphaZero is not initially good at anything. – Philip Roe Dec 13 '17 at 03:05
On the other hand, I suppose, it does know what it wants to become good at. I suspect the moderators would like us to move to chat. I would be OK with that? – Philip Roe Dec 13 '17 at 03:07
@PhilipRoe The problem with natural selection is that the game of chess plays no part in which organisms live, die, or reproduce (*The Seventh Seal* notwithstanding). Plus, AlphaZero is not a genetic or evolutionary algorithm. It is a neural net that mimics a brain that learns by making and breaking connections between simple elements by comparing what it currently thinks to what actually happens on the board. It is a much more active and direct process than evolution. – Mark H Dec 13 '17 at 04:22
So 100 billion.. 10^11 positions. So for every position it looked at, there are 1 with 35 zeroes positions that it didn't. And yet, apparently that was good enough. It makes me wonder how good it would be if it had to play 960 without re-training. – Roy Koczela Dec 13 '17 at 18:55
We now have some evidence due to Allie (a leela derivative that now can play 960). It's about 100 elo worse, mainly due to being very confused by castling in 960 – Oscar Smith Jun 06 '19 at 07:30

score 1 · Answer 2 · answered Jun 01 '21 at 14:07

Here, it is crucial that AlphaZero is based on a Deep Neural Network (DNN) which essentially means that the network consists of many layers of, mostly, different types (regarding how the information is collected and processed). This allows such networks to subdivide the evaluation problem to a large (and flexible) number of elements it can process separately and then combine the results to a final answer.

If the Network was too flat (standard NN with small number of layers) it would simply smoothly extrapolate the knowledge it has learned from the examples. While this works fine (for over 30 years alredy) in simple tasks like recognition of hand-written letters it fails in more subtle problems when many details must be put together in a complex way. Think of animal recognition. Sometimes you have a large cat positioned frontally but in bad light or shadow, sometimes a small cat rolling on its back, and so on. You must learn which elements are crucial for deciding that "it is a cat".

This requires the subdivision of the problem into many small pieces and subsequent gathering of the pieces of information to reach a conclusion. This process (division, gathering) can be repeated several times. In DNNs both these functions can be guaranteed both automatically by suitable learning algorithms and manually by choosing special types of layers and their connections by the designer. It is the power of DNNs that they can organize their information flow very flexibly -- as this is usually not possible to a priori define which subtasks should be performed.

AlphaZero does the same: it analyses not only the full position itself of which there are so many possibilities that it would never learn to reach proper conclusions by simply extrapolating its limited knowledge. It analyses also subgroups of pieces, local configurations and learns to understand features like e.g. pin, fork, etc. Then, it combines plenty of such pieces of information to obtain a global picture and find the total position estimate. The way AZ does it is pretty flexible and dynamical.

How does AlphaZero learn to evaluate a position it has never seen?

2 Answers2

Linked