AlphaZero and Monte Carlo Tree Search, explained from scratch

This page explains the search algorithm at the heart of this project (https://github.com/DavidBarbera/alphazero-simple). It assumes no background in reinforcement learning — if you can read a little Python and you know the rules of tic-tac-toe, you can follow along.

It covers three things:

the idea of Monte Carlo Tree Search (MCTS) and why games need it,
how the implementation in alphazero/mcts.py works, line by line, and
how AlphaZero (Silver et al., Science 2018) turns this classic algorithm into something superhuman.

TL;DR. To decide on a move, MCTS plays out many imaginary continuations of the game, keeps statistics on how each one tends to turn out, and then picks the move it tried most often. It spends its effort on promising lines instead of examining every possibility. AlphaZero keeps this exact loop but replaces two hand-made pieces — which moves to try and how to judge a position — with a neural network.

1. Why not just look at every move?

The obvious way to play a game perfectly is to look at every move, every reply to every move, and so on to the end, then pick the line that wins. This is called searching the game tree: the root is the current position, and each branch is a possible move.

The problem is size.

How big a game tree is, and how little of it MCTS builds

Tic-tac-toe has only about 10⁵ possible games — small enough to brute-force. But the same approach is hopeless for chess (~10¹²⁰). You cannot look at everything, so you must be selective. MCTS is one principled way to be selective: it grows the tree outward along the branches that look most worthwhile, guided by what it has learned from earlier playouts.

2. The one-paragraph idea

MCTS builds its tree one simulation at a time. Each simulation walks from the root down into the tree, adds a new position, guesses how good that position is by playing the rest of the game at random, and then updates the statistics of every position it passed through. After a fixed budget of simulations (say a few hundred), the move that was visited most often is played. The more a move is visited, the more the search believed it was worth investigating — so visit count is a remarkably good measure of move quality.

3. The loop: four phases

Every simulation is the same four steps, repeated:

flowchart LR
    A["1 · Select<br/>follow UCT down to a leaf"] --> B["2 · Expand<br/>add one child"]
    B --> C["3 · Simulate<br/>random playout to the end"]
    C --> D["4 · Backup<br/>update visits N and value Q"]
    D -->|repeat for the whole budget| A
    D -.->|when the budget is spent| E["Play the<br/>most-visited move"]

Here is what each phase does to the same tree as it grows:

The four phases of one MCTS simulation

Select — start at the root and repeatedly step into the most promising child until you reach a position that hasn’t been fully explored yet. “Most promising” is defined by the UCT rule in the next section.
Expand — add one new child node for a move that hasn’t been tried from that position.
Simulate (also called a rollout or playout) — from the new node, play the game to the end choosing random legal moves for both sides, and see who wins. This is a cheap, noisy estimate of how good the new position is.
Backup — walk back up to the root, adding the result to every node on the path so their averages reflect what just happened.

4. Selection: explore or exploit?

At each step down the tree, selection faces a dilemma familiar to anyone choosing a restaurant: do you go back to the place you already know is good (exploit), or try somewhere new that might be even better (explore)? Always exploiting means you might miss a great move you only tried once and got unlucky with. Always exploring wastes simulations on bad moves.

MCTS resolves this with the UCT formula (Upper Confidence bounds applied to Trees). For each candidate move $a$ from position $s$ , it computes a score:

\text{UCT}(s, a) \;=\; \underbrace{Q(s, a)}_{\text{exploitation}} \;+\; \underbrace{c \sqrt{\frac{\ln N(s)}{N(s, a)}}}_{\text{exploration}}

and picks the move with the highest score, where

$Q(s, a)$ is the average result of simulations that went through move $a$ (between $-1$ for “we always lose” and $+1$ for “we always win”),
$N(s, a)$ is how many times move $a$ has been tried,
$N(s)$ is how many times the parent position has been visited, and
$c$ is a knob controlling how much exploration matters (this repo uses $c \approx \sqrt{2}$ ).

The second term is large when a move has been tried few times relative to its siblings, and it shrinks as that move accumulates visits. So a move gets attention either because it has a high average value, or because it is under-explored:

How UCT trades off a move's value against how little it has been tried

Move B is chosen here even though its average value is lower, because it has barely been tried — its exploration bonus is large. As B accumulates visits, that bonus shrinks, and selection naturally shifts toward whichever move truly has the higher value.

In the code, this is _best_uct_child. One subtlety: every node stores its value from the point of view of the player about to move there. A child node is the opponent’s turn, so what’s good for the child is bad for us — hence the -child.Q:

def _best_uct_child(self, node):
    log_parent = math.log(node.N)
    best_child, best_score = None, -math.inf
    for child in node.children.values():
        exploit = -child.Q          # child.Q is from the opponent's POV
        explore = self.c * math.sqrt(log_parent / child.N)
        score = exploit + explore
        if score > best_score:
            best_score, best_child = score, child
    return best_child

5. Simulation: a quick, random guess

Once selection and expansion reach a fresh node, MCTS needs a value for it — but the game isn’t over yet. Classic MCTS answers with a rollout: just play random moves for both players until the game ends, and use that outcome.

A random playout from a leaf to a finished game

A single random game is a noisy guess. The magic is in averaging: across hundreds of simulations, the random noise washes out and the averages at each node converge toward the true value of the position. In code this is _rollout:

def _rollout(self, board, to_play):
    """Play random moves to the end; return the result from to_play's POV."""
    b, p = board, to_play
    while True:
        result = self.game.get_game_ended(b, to_play)
        if result != 0:
            return float(result)           # +1 win / -1 loss / ~0 draw
        valid = self._valid_actions(b, p)
        action = int(self.rng.choice(valid))
        b, p = self.game.get_next_state(b, p, action)

Random rollouts are the weakest part of classic MCTS, and exactly what AlphaZero throws away — see section 8.

The rollout produced one number. Backup adds it to every node on the path from the leaf up to the root, so each node’s running average reflects the new evidence.

The one tricky part is the sign. Because each node is “owned” by the player to move there, and players alternate, a result that is good for one player is bad for the next. So the value flips sign at every level on the way up:

Why the backed-up value flips sign at each level of the tree

In code, _backup handles this by comparing each node’s player to the leaf’s player:

def _backup(self, leaf, value):
    leaf_player = leaf.to_play
    node = leaf
    while node is not None:
        node.N += 1
        # same player as the leaf -> add value; opponent -> subtract it
        node.W += value if node.to_play == leaf_player else -value
        node = node.parent

This negamax-style sign handling is the single most common place to introduce a bug, which is why the tests in tests/test_mcts.py check that the search takes a winning move, blocks a losing one, and never loses to a random opponent.

7. Choosing the move to play

After the simulation budget is spent, the root’s children carry visit counts. The search’s recommendation is the distribution of those visits,

\pi(a \mid s_\text{root}) \;\propto\; N(s_\text{root}, a),

and for actual play we simply take the most-visited move. (In Stage 4, when we train AlphaZero, this whole distribution $\pi$ becomes the target the policy network learns to imitate.) That is the last line of search:

return max(root.children.items(), key=lambda kv: (kv[1].N, kv[1].Q))[0]

8. How AlphaZero supercharges this

Everything above is classic MCTS — it knows nothing but the rules and guesses by playing random games. AlphaZero keeps the identical four-phase loop but replaces its two hand-made pieces with a single neural network $f_\theta(s) = (\mathbf{p}, v)$ that outputs a policy $\mathbf{p}$ (a prior over moves) and a value $v$ (an estimate of who is winning):

What AlphaZero replaces in the search

Change 1 — selection uses a learned prior (PUCT). Instead of treating untried moves uniformly, AlphaZero weights the exploration term by the network’s prior $P(s, a)$ , so search concentrates on moves the network already thinks are good:

a^\star \;=\; \arg\max_a \; Q(s, a) \;+\; c_\text{puct}\, P(s, a)\, \frac{\sqrt{\sum_b N(s, b)}}{1 + N(s, a)}

Change 2 — evaluation uses the value head, not a rollout. The expensive random playout to the end is replaced by a single call to the value head $v$ . No dice are rolled; the network simply judges the position.

The two changes feed each other: a good prior means fewer wasted simulations, and a good value estimate means each simulation is informative. The paper reports that AlphaZero examined only about 60,000 positions per second, versus roughly 60 million for the traditional chess engine Stockfish — a far smaller, more selective search that nonetheless played stronger, an approach the authors liken to the more human-like search first imagined by Claude Shannon.

	Vanilla MCTS (this repo, Stage 2)	AlphaZero (Stage 3)
Which moves to explore	UCT, all untried moves equal	PUCT, weighted by policy prior $P(s,a)$
How to evaluate a leaf	random rollout to the end	one value-head call $v$
Selection rule	$Q + c\sqrt{\ln N / n}$	$Q + c_\text{puct}\,P\,\sqrt{\sum N}/(1+n)$
Domain knowledge used	only the rules	only the rules
Where the strength comes from	sheer number of playouts	a network trained by self-play

This is precisely the jump we make in Stage 3: build the network, swap the rollout for its value head, and add the prior to selection. The search code barely changes — which is the whole point of writing it cleanly now.

9. How it maps to the code

The implementation is one small class plus a private node type:

classDiagram
    class MCTS {
        +num_simulations
        +c : exploration constant
        +search(board, to_play) int
        -_select(node) Node
        -_expand(node) Node
        -_best_uct_child(node) Node
        -_rollout(board, to_play) float
        -_backup(leaf, value)
    }
    class Node {
        +to_play : +1 / -1
        +N : visit count
        +W : total value
        +Q : W / N
        +children
        +untried
    }
    MCTS --> Node : builds a tree of

Concept in this page	Where it lives in `mcts.py`
One simulation	one iteration of the loop in `search`
Phase 1 · Select	`_select` + `_best_uct_child`
Phase 2 · Expand	`_expand`
Phase 3 · Simulate	`_rollout`
Phase 4 · Backup	`_backup`
Per-position statistics $N$ , $W$ , $Q$	the `_Node` fields
Search policy $\pi$ (visit counts)	`MCTS.root_visit_counts`
The move actually played	the `max(...)` at the end of `search`

And the loop itself, which reads almost exactly like the four phases:

def search(self, board, to_play):
    root = _Node(board, to_play, None, self._valid_actions(board, to_play))
    for _ in range(self.num_simulations):
        leaf  = self._select(root)               # phases 1 + 2
        value = self._rollout(leaf.board, leaf.to_play)  # phase 3
        self._backup(leaf, value)                # phase 4
    self.root_visit_counts = {a: c.N for a, c in root.children.items()}
    return max(root.children.items(), key=lambda kv: (kv[1].N, kv[1].Q))[0]

10. Try it yourself

The search is wired into the web UI as the “vs MCTS” opponent — see the main README for how to run it. Playing against it, two things stand out and both motivate Stage 3:

it is effectively unbeatable at tic-tac-toe, but
it “thinks” for hundreds of simulations per move and has no memory — it rebuilds the tree from scratch every turn and judges positions by random playouts.

Stage 3 fixes both by giving the search a network that has learned to evaluate positions and suggest moves, so the same search gets far stronger with far fewer simulations.

References

D. Silver et al., “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,” Science 362, 1140–1144 (2018). DOI
L. Kocsis and C. Szepesvári, “Bandit based Monte-Carlo Planning,” ECML (2006) — the original UCT algorithm.
C. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” IEEE TCIAIG (2012) — a thorough survey of MCTS variants.

The diagrams on this page are generated by docs/figures.py (pure Python, no dependencies). Run it to reproduce or tweak them.