Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org

Missing
DCU student

CASE3 student Paul Bunbury is missing since Thur 2 Feb 2012.
See appeals on crime.ie and garda.ie and facebook.

He is a great coder. See DCU page and boards.ie page.
He won major coding contests in 2010 and 2011.
He is author of the brilliant "FloodItWorld".
DCU can confirm that in Jan 2012 he passed all 6 modules comfortably.


How Q-learning works


2.1.6 How Q-learning works

Action a leads from x to y.
Action b leads from y to z.
Action c leads from z to waterhole and reward.

Let all Q-values start at 0.
r > 0
γ > 0

Go from y to z 100 times, but never went from z to waterhole:
Q(y,b) = 0

Eventually, in state z we took action c:
Q(z,c) = r       [2]
At this point this is the only non-zero Q-value in the whole system.

Next time we take action b to go from y to z, we factor in the (best) Q-value of the next state:
Q(y,b) = 0 + γ ( r )       [1]
        = γ r

Next time we go from x to y, this value is propagated back:
Q(x,a) = 0 + γ ( γ r )       [1]
        = γ2 r

If r is only non-zero reward in system, then even 100 steps away from waterhole it will take action heading towards waterhole, since:
γ100 r > 0       [3]
Q-value = 0 + γ 0 + γ2 0 + ... + γ100 r




[1] Simplification 1:


Actually, I simplified above. The learning rate is actually constantly changing.

Here's how it works precisely. We try b repeatedly in y and it leads to r=0 and new state z. As of now, z is of no interest. The updates for Q(y,b) go:

Q(y,b) := 0 (0) + 1 ( 0 + γ 0 ) = 0
Q(y,b) := 1/2 (0) + 1/2 ( 0 + γ 0 ) = 0
Q(y,b) := 2/3 (0) + 1/3 ( 0 + γ 0 ) = 0
...
Q(y,b) = 0 and α(y,b) = 1/n.
Similarly, Q(x,a) = 0 and α(x,a) = 1/m.

Eventually, some time in state z, we try action c and get a reward. We get this reward the very first time we try (z,c).
The updates for Q(z,c) go:       [2]

Q(z,c) := 0 (0) + 1 ( r + γ 0 )
        = r
...
Next time we try (y,b), again we don't get reward, but now it does lead somewhere interesting - z now has a non-zero Q-value:
Q(y,b) := (n-1)/n (0) + 1/n ( 0 + γ r )
        = 1/n γ r
...



Q. How about Q(x,a) now? What is the update the next time we try (x,a)?




Let's carry on.
Next time you update Q(y,b):

Q(y,b) := n/(n+1) (1/n γ r) + 1/(n+1) ( 0 + γ r )
        = 2/(n+1) γ r
and again:
Q(y,b) := (n+1)/(n+2) (2/(n+1) γ r) + 1/(n+2) ( 0 + γ r )
        = 3/(n+2) γ r
with increasing t:
Q(y,b) = t / (t+n-1) γ r
(n-1) constant - the number of times we scored Q(y,b) before we discovered z was good.

As t   -> infinity  
Q(y,b)   ->   γ r

Conclusion - It doesn't matter in the long run that we updated Q(y,b) any finite number of times before learning about z.



Q. Prove Q(x,a)   ->   γ2 r



[2] Simplification 2:

In fact, I still simplified. Q(z,c) will not stay at r (with 0 forever after) unless the waterhole is an absorbing state or an absorbing group of states (an attractor). But if absorbing, how do we get back to y and x?

If not absorbing state, what might Q(z,c) really look like. Consider these two possibilities:

  1. x = geographical state
    Then you can go to waterhole, hop back, get r again, etc.
    Q(z,c) = r + γ 0 + γ2 r + γ3 0 + γ4 r + ...
            = r + γ2 r + γ4 r + ...

  2. x = (geographical state, combined with internal state such as thirst)
    i.e. this is an abstract state-space, not a geographical state-space.
    Then the act of drinking jumps you to a far-away state (in state-space, not in physical space) where you are close geographically but are not thirsty any more. This is one-way - you cannot just jump back to get r again. You may have to go through a large roundabout chain of states (getting thirsty again) to get back to z.
    Q(z,c) = r + 0 + γ100 r + 0 + γ200 r + ...
            = r + γ100 r + γ200 r + ...


One answer to all this:
It does not have to be an absorbing state. Let Q(z,c) go to whatever value, say Q(z,c) = Q. Then:

Q(y,b)   ->   γ Q

Q(x,a)   ->   γ2 Q

etc., as before



[3] Simplification 3:

Actually, rather than:
γ100 r   >   0
it will be:
γ100 r   >   γ102 r
(If you don't move towards the waterhole, you move away, but then you can move back in 1 more step). (*)

(*) Assuming you can move back, i.e. this is two-way, not one-way.

Normally (depending on the MDP problem of course) the good action doesn't have a Q-value massively bigger than the bad actions, only a bit bigger.
Because Q-learning measures you making the wrong choice now and then the right choice afterwards
(rather than measuring you making the wrong choice, followed by the wrong choice, etc.).

Q-value of stupid action = rstupid + γ maxclever Q(y,b)


Stimulus-Response

Q-learning pushes the limits of stimulus-response behaviour. By associating x with y (and y with z and so on) it shows how an apparently "stimulus-response" agent can behave as if it has a past and future.

The whole trend of modern behaviour-based AI is to push the limits of what stimulus-response can do before postulating more complex models (though we all agree that more complex models are needed eventually). See Discussion of Stimulus-Response in PhD.




Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.