The program will play against another copy of itself. That is, at first you will be playing against a random player. As you learn, your opponent gets better.
Observe x Take a (Opponent makes move) Observe y
Q(x,a) := r
Q(x,a) := (1-α) Q(x,a) + α rwhere α goes 1, 1/2, 1/3, ...
Q(x,a) := (1-α) Q(x,a) + α ( r + γ c )where c = best Q-value in next state