All code here can be found in this repository.
First of all, I'm sorry for not posting much this summer. I've been pretty wrapped up in some cool stuff.
Super Tic Tac Toe is a variant of the classic game tic tac toe. Presh Talwalkar explains it well here. My goal with this short post is to design (and not run) a reinforcement learning procedure in order to have a Deep Q Network (DQN) learn the optimal strategy to win super tic tac toe. The structure of this post is as follows:
Note: I will not be training the model. I'm just exercising my understanding of reinforcement learning.
Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. The key components of an RL system are:
The objective is to learn a policy \(\pi\) that maximizes the expected cumulative reward over time, often defined as the return \( R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \), where \( \gamma \) is the discount factor.
Super Tic Tac Toe is a complex game comprising a \(3 \times 3\) grid where each cell itself is a \(3 \times 3\) Tic Tac Toe board. Let's denote:
The state \( s \) can be represented as a flattened array of size 81:
\[ s = \text{flatten}(\{L_{ij}\}_{i,j=0}^2) \]
The action \( a \) is choosing a cell in the \(3 \times 3\) grid of the designated small board. This can be represented as a single integer in the range \([0, 80]\) for simplicity.
Q-learning is a value-based method of RL which seeks to learn the optimal action-selection policy:
\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_a' Q(s', a') - Q(s, a) \right] \]
where:
We use a neural network to approximate the Q-value function \( Q(s, a; \theta) \), where \( \theta \) are the weights of the network. The loss function for training the network is:
\[ L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] \]
where \( \theta^- \) are the weights of the target network, which are periodically updated to stabilize training.