AlphaZero from scratch · David S. Barbera

A small, readable implementation of AlphaZero (Silver et al., Science 2018) built up one stage at a time, plus a polished web UI for actually playing the games. We start with tic-tac-toe because the algorithm is identical across games but tic-tac-toe trains in minutes on a laptop — then we reuse the same code (and the same UI) for harder games.

The big idea in three sentences

One neural network looks at a board and outputs a policy (which moves look good) and a value (who is likely to win).
A Monte Carlo Tree Search uses that network to look ahead and produce a better policy than the raw network.
The network trains on its own search results from self-play, bootstrapping from random play to superhuman with no external data.