Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org

Missing
DCU student

CASE3 student Paul Bunbury is missing since Thur 2 Feb 2012.
See appeals on crime.ie and garda.ie and facebook.

He is a great coder. See DCU page and boards.ie page.
He won major coding contests in 2010 and 2011.
He is author of the brilliant "FloodItWorld".
DCU can confirm that in Jan 2012 he passed all 6 modules comfortably.


Convergence


2.2.1 Convergence



Notes on α

The learning rate α,
where tex2html_wrap_inline6763,
should take decreasing (with each update) successive values tex2html_wrap_inline6765,
such that tex2html_wrap_inline6767 and tex2html_wrap_inline6769.


Notes:
  1. The harmonic series 1 + 1/2 + 1/3 + ... is an infinite sum.
    Proof: Oresme

  2. The infinite series: 1 + (1/2)p + (1/3)p + ... is a finite sum if p > 1 (and infinite for p < = 1).
    Proof: The sum is equal to:
    1   +   ( (1/2)p+(1/3)p )   +   ( (1/4)p+(1/5)p+(1/6)p+(1/7)p )   + ...
    < 1   +   ( (1/2)p+(1/2)p )   +   ( (1/4)p+(1/4)p+(1/4)p+(1/4)p )   + ...
    = 1   +   2 (1/2)p   +   4 (1/4)p   + ...
    = 1 + r + r2 + r3 + ...
    where r = (1/2)p-1
    p-1 > 0, i.e. r < 1

    
    

    
    
    and the geometric series 1 + r + r2 + r3 + ... with 0 < r < 1 is a finite sum. Proof.


Consider: tex2html_wrap_inline6795 = 1, (1/2)p, (1/3)p, ...
Need: 1 + (1/2)p + (1/3)p + .... to be infinite,
hence: p < = 1
Need: 1 + (1/2)2p + (1/3)2p + .... to be finite,
hence: 2p > 1
hence: tex2html_wrap_inline6797.



Map of α(x,a)

Consider map of α(x,a)

At start, whole map is 1.
After a time, unvisited (x,a) pairs still have α(x,a) = 1.
Visited ones are down to 1/n (which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.



Optimal policy found before convergence

Each Q-value Q(x,a) converges to Q*(x,a)
where the optimum action to take in x is the action a* s.t.
Q*(x,a*) = maxb Q*(x,b)

But what normally happens is:
Long before the Q-values finalise, we will have fixated on the best action. Often surprisingly early, we find that:
maxb Q(x,b) = Q(x,a*)
(note Q, not Q*) - that is, the leader now will still be the leader when the Q-values converge - the leader now is actually a* though we don't know it yet.

In practice, the competition rapidly settles down into a contest between 1 or 2 actions.



Need infinite time?

"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a non-zero probability of any pattern of unlucky samples - e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.

Q. If coin is biased 75-25 towards tails, what is prob. of 1000 heads in a row?

Q. If you did one 1000-toss experiment a day, how long would you wait to see a single 1000-head event?


p(policy is optimal)

Another way of looking at it:

As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:

As time goes to infinity, p(policy is optimal)   ->   1.

In practice, need to be very unlucky not to have optimal policy after short finite time.

e.g. After short finite time, p(policy is optimal) = 0.99999.


A perverse world

You could deliberately design a difficult world, in which learning the optimal policy took a long time.
Example: Pxa(y) = 0.01, Pyb(z) = 0.01.
No other way of getting to z other than taking a and b from x (even though most times they don't lead to z).
z then has one action c with very high r's, to make up for the low probability of getting them (i.e. the high probability of going to other, worthless states if you take action b in y).
Q(y,b) = 0 + γ E(V(next state))
V(z) is very high, but y,b only leads there with prob. 0.01.
V(state) is very low for the states that y,b leads to with combined prob. 0.99.
(0.99) low + (0.01) high
can still be high if "high" is high enough.
Hence Q(y,b) can be high if the r for c is high enough.

Actions other than c in z are worthless.
Optimal policy is a in x, b in y, c in z.
But will take a long time to learn this. Because it's hard to get to y at all. And even harder to get to z at all. And you have to get to both many times in order to find b and c.


Can you learn optimal policy in finite n steps?

It depends what n is.

Given that a real-world experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.

How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.



Convergence proof shows Q-learning automatically adapts if world/problem changes

To "forget" old stuff, reset α = 1.
But in fact you don't have to:
α may start anywhere along the sequence and conditions for convergence satisfied. See Theorem A.2. So can just keep learning and old stuff is eventually wiped.

e.g. Say world changes from MDP1 to MDP2 after time t. Just keep going with Q-learning and will learn optimal policy for MDP2 (eventually) and will forget what it learnt for MDP1 (eventually). No need to change anything.




2.2.1 Convergence of Dynamic Programming and Q-learning


"For all x,a, try a in x"

DP - Don't actually visit any state. Don't actually take any action. Can do it all in your head. What would happen if we took action a in x.

RL - Have to actually be in state x and have to actually execute action a to find out what happens.
Problem - Can't systematically say "for all x,a, go to x, take a" because don't know how to get to x. That's the whole problem.
If we knew how to get to any x, no need to learn anything! Just say "start chess game. go to checkmate" and you're done!

DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.


Use DP or RL?

If we know Pxa(y), then DP can construct the optimal policy without interacting with the world at all.

If we don't know Pxa(y), then Q-learning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.

But of course if we don't know Pxa(y), how do we know it's a Markov world!


Paradox

If we know it's an MDP, we don't need Q-learning!

If we don't know it's an MDP, we don't know if Q-learning works!


In practice, no one ever applies Q-learning to an MDP!

Q-learning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Q-learning will produce an alright policy).




Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.