Alpha Zero, Otherwise Known as "Perfect Cell". This is What We Need for Future Gal Civ AIs. Worth Your Time Frogboy!

https://en.chessbase.com/post/alpha-zero-comparing-orang-utans-and-apples

https://www.youtube.com/watch?v=jpkeAQG6kQw

[The music in the second link provides the appropriate setting to properly read the article in the first link].

Advanced Search

Watch this post

Do not email me updates for this post Email me updates for this post

Successfully updated karma reason!

marigoldran

Join Date 06/2013

+41

https://web.stanford.edu/~surag/posts/alphazero.html

Stanford replicating the results with Othello.

Successfully updated karma reason!

marigoldran

Join Date 06/2013

+41

[From the Stanford link above]

Unsurprisingly, there's a neural network at the core of things. The neural network $f_{θ}$ is parameterised by $θ$ and takes as input the state $s$ of the board. It has two outputs: a continuous value of the board state $v_{θ} (s) \in [- 1, 1]$ from the perspective of the current player, and a policy ${\vec{p}}_{θ} (s)$ that is a probability vector over all possible actions.

When training the network, at the end of each game of self-play, the neural network is provided training examples of the form $(s_{t}, {\vec{π}}_{t}, z_{t})$ . ${\vec{π}}_{t}$ is an estimate of the policy from state $s_{t}$ (we'll get to how ${\vec{π}}_{t}$ is arrived at in the next section), and $z_{t} \in {- 1, 1}$ is the final outcome of the game from the perspective of the player at $s_{t}$ (+1 if the player wins, -1 if the player loses). The neural network is then trained to minimise the following loss function (excluding regularisation terms): $l = \sum t (v_{θ} (s_{t}) - z_{t})^{2} - {\to π}_{t} \cdot log ({\to p}_{θ} (s_{t}))$ The underlying idea is that over time, the network will learn what states eventually lead to wins (or losses). In addition, learning the policy would give a good estimate of what the best action is from a given state. The neural network architecture in general would depend on the game. Most board games such as Go can use a multi-layer CNN architecture. In the paper by DeepMind, they use 20 residual blocks, each with 2 convolutional layers. I was able to get a 4-layer CNN network followed by a few feedforward layers to work for 6x6 Othello.

Monte Carlo Tree Search for Policy Improvement

Given a state $s$ , the neural network provides an estimate of the policy ${\vec{p}}_{θ}$ . During the training phase, we wish to improve these estimates. This is accomplished using a Monte Carlo Tree Search (MCTS). In the search tree, each node represents a board configuration. A directed edge exists between two nodes $i \to j$ if a valid action can cause state transition from state $i$ to $j$ . Starting with an empty search tree, we expand the search tree one node (state) at a time. When a new node is encountered, instead of performing a rollout, the value of the new node is obtained from the neural network itself. This value is propagated up the search path. Let's sketch this out in more detail.

For the tree search, we maintain the following:

$Q (s, a)$ : the expected reward for taking action $a$ from state $s$ , i.e. the Q values
$N (s, a)$ : the number of times we took action $a$ from state $s$ across simulations
$P (s, \cdot) = {\vec{p}}_{θ} (s)$ : the initial estimate of taking an action from the state $s$ according to the policy returned by the current neural network.

From these, we can calculate $U (s, a)$ , the upper confidence bound on the Q-values as $U (s, a) = Q (s, a) + c_{p u c t} \cdot P (s, a) \cdot \frac{\sqrt{Σ_{b} N (s, b)}}{1 + N (s, a)}$ Here $c_{p u c t}$ is a hyperparameter that controls the degree of exploration. To use MCTS to improve the initial policy returned by the current neural network, we initialise our empty search tree with $s$ as the root. A single simulation proceeds as follows. We compute the action $a$ that maximises the upper confidence bound $U (s, a)$ . If the next state $s^{'}$ (obtained by playing action $a$ on state $s$ ) exists in our tree, we recursively call the search on $s^{'}$ . If it does not exist, we add the new state to our tree and initialise $P (s^{'}, \cdot) = {\vec{p}}_{θ} (s^{'})$ and the value $v (s^{'}) = v_{θ} (s^{'})$ from the neural network, and initialise $Q (s^{'}, a)$ and $N (s^{'}, a)$ to 0 for all $a$ . Instead of performing a rollout, we then propagate $v (s^{'})$ up along the path seen in the current simulation and update all $Q (s, a)$ values. On the other hand, if we encounter a terminal state, we propagate the actual reward (+1 if player wins, else -1).

After a few simulations, the $N (s, a)$ values at the root provide a better approximation for the policy. The improved stochastic policy $\vec{π} (s)$ is simply the normalised counts $N (s, \cdot) / \sum_{b} (N (s, b))$ . During self-play, we perform MCTS and pick a move by sampling a move from the improved policy $\vec{π} (s)$ . Below is a high-level implementation of one simulation of the search algorithm.

Successfully updated karma reason!