How Q-learning works


2.1.6 How Q-learning works

Action a leads from x to y.
Action b leads from y to z.
Action c leads from z to waterhole and reward.

Let all Q-values start at 0.
r > 0
γ > 0

Go from y to z 100 times, but never went from z to waterhole:
Q(y,b) = 0

Eventually, in state z we took action c:
Q(z,c) = r       [2]
At this point this is the only non-zero Q-value in the whole system.

Next time we take action b to go from y to z, we factor in the (best) Q-value of the next state:
Q(y,b) = 0 + γ ( r )       [1]
        = γ r

Next time we go from x to y, this value is propagated back:
Q(x,a) = 0 + γ ( γ r )       [1]
        = γ2 r

If r is only non-zero reward in system, then even 100 steps away from waterhole it will take action heading towards waterhole, since:
γ100 r > 0       [3]
Q-value = 0 + γ 0 + γ2 0 + ... + γ100 r




[1] Simplification 1:


Actually, I simplified above. The learning rate is actually constantly changing.

Here's how it works precisely. We try b repeatedly in y and it leads to r=0 and new state z. As of now, z is of no interest. The updates for Q(y,b) go:

Q(y,b) := 0 (0) + 1 ( 0 + γ 0 ) = 0
Q(y,b) := 1/2 (0) + 1/2 ( 0 + γ 0 ) = 0
Q(y,b) := 2/3 (0) + 1/3 ( 0 + γ 0 ) = 0
...
Q(y,b) = 0 and α(y,b) = 1/n.
Similarly, Q(x,a) = 0 and α(x,a) = 1/m.

Eventually, some time in state z, we try action c and get a reward. We get this reward the very first time we try (z,c).
The updates for Q(z,c) go:       [2]

Q(z,c) := 0 (0) + 1 ( r + γ 0 )
        = r
...
Next time we try (y,b), again we don't get reward, but now it does lead somewhere interesting - z now has a non-zero Q-value:
Q(y,b) := (n-1)/n (0) + 1/n ( 0 + γ r )
        = 1/n γ r
...



Q. How about Q(x,a) now? What is the update the next time we try (x,a)?




Let's carry on.
Next time you update Q(y,b):

Q(y,b) := n/(n+1) (1/n γ r) + 1/(n+1) ( 0 + γ r )
        = 2/(n+1) γ r
and again:
Q(y,b) := (n+1)/(n+2) (2/(n+1) γ r) + 1/(n+2) ( 0 + γ r )
        = 3/(n+2) γ r
with increasing t:
Q(y,b) = t / (t+n-1) γ r
(n-1) constant - the number of times we scored Q(y,b) before we discovered z was good.

As t   -> infinity  
Q(y,b)   ->   γ r

Conclusion - It doesn't matter in the long run that we updated Q(y,b) any finite number of times before learning about z.



Q. Prove Q(x,a)   ->   γ2 r



[2] Simplification 2:

In fact, I still simplified. Q(z,c) will not stay at r (with 0 forever after) unless the waterhole is an absorbing state or an absorbing group of states (an attractor). But if absorbing, how do we get back to y and x?

If not absorbing state, what might Q(z,c) really look like. Consider these two possibilities:

  1. x = geographical state
    Then you can go to waterhole, hop back, get r again, etc.
    Q(z,c) = r + γ 0 + γ2 r + γ3 0 + γ4 r + ...
            = r + γ2 r + γ4 r + ...

  2. x = (geographical state, combined with internal state such as thirst)
    i.e. this is an abstract state-space, not a geographical state-space.
    Then the act of drinking jumps you to a far-away state (in state-space, not in physical space) where you are close geographically but are not thirsty any more. This is one-way - you cannot just jump back to get r again. You may have to go through a large roundabout chain of states (getting thirsty again) to get back to z.
    Q(z,c) = r + 0 + γ100 r + 0 + γ200 r + ...
            = r + γ100 r + γ200 r + ...


One answer to all this:
It does not have to be an absorbing state. Let Q(z,c) go to whatever value, say Q(z,c) = Q. Then:

Q(y,b)   ->   γ Q

Q(x,a)   ->   γ2 Q

etc., as before



[3] Simplification 3:

Actually, rather than:
γ100 r   >   0
it will be:
γ100 r   >   γ102 r
(If you don't move towards the waterhole, you move away, but then you can move back in 1 more step). (*)

(*) Assuming you can move back, i.e. this is two-way, not one-way.

Normally (depending on the MDP problem of course) the good action doesn't have a Q-value massively bigger than the bad actions, only a bit bigger.
Because Q-learning measures you making the wrong choice now and then the right choice afterwards
(rather than measuring you making the wrong choice, followed by the wrong choice, etc.).

Q-value of stupid action = rstupid + γ maxclever Q(y,b)


Stimulus-Response

Q-learning pushes the limits of stimulus-response behaviour. By associating x with y (and y with z and so on) it shows how an apparently "stimulus-response" agent can behave as if it has a past and future.

The whole trend of modern behaviour-based AI is to push the limits of what stimulus-response can do before postulating more complex models (though we all agree that more complex models are needed eventually). See Discussion of Stimulus-Response in PhD.