do forever observe x suggest a according to some decision-algorithm execute a observe y receive reward r(y) or r(x,y) modify decision-algorithm repeat
a can be "no-op".
r can be 0 every step when nothing happens.
Start in state x:
In state x,
taking action b gives a guaranteed reward of 1.5.
Taking action a normally leads to a reward of only 1,
but sometimes leads to a reward of 5.
Pxa(y) = 0.8
Pxa(z) = 0.2
Pxa(w) = 0 for all w other than y,z
(probabilistic transition)
Pxb(s) = 1
Pxb(w) = 0 for all w other than s
(deterministic transition)
x = ( direction )
Consider:
P ( xt+1=bump |
xt=st, at=bt,
xt-1=st-1, at-1=bt-1,
... )
P ( bump | wall is N, move N, wall is N, move N, wall is N, move N )
is higher than:
P ( bump | wall is N, move N, wall not visible, move N, wall not visible, move N )
i.e. There isn't simply a constant:
P ( bump | wall is N, move N )
Q. What simple change would turn the above into a MDP (in fact, a deterministic MDP)?
Consider:
Observe x Take a (Opponent makes non-deterministic move) Observe y
In the above,
How do we decide which action to take?
Look at expected reward:
Ea(r) = 0.2 (5) + 0.8 (1)
= 1.8
Eb(r) = 1 (1.5)
= 1.5
Action a is better.
What do the probabilities mean?
They mean that if you take a 100 times,
you expect 20 5's and 80 1's.
Total reward 180.
If you take b 100 times
you expect 100 1.5's.
Total reward 150.
Maximise E(r).
You are in state x = (1,5).
Imagine that if you take some action a,
you go to state y with
probabilities:
You can calculate E(r).y Pxa(y) reward (1,8) 0.4 10 (0,8) 0.3 5 (1,5) 0.2 1 (0,5) 0.1 0
But "E(y)" is meaningless.
Consider the sum of the y's times their probabilities
= (1,8) 0.4 + (0,8) 0.3 + (1,5) 0.2 + (0,5) 0.1
= (0.6, 7.1)
You are never in this state when you take a.
In fact, this state doesn't exist.
"0.6" doesn't have a meaning.
The state space is not a continuous space. It is a discrete space of points.
Typically, we don't reward for the correct action, r = r(x,a):
in state x if took action b then reward 1 else reward 0because we don't know the correct action!
in state x if got to state z then reward 1 else reward 0
e.g. In the following:
Taking action a in state x,
P(r=5) = 0.2
P(r=0) = 0.4
P(r=3) = 0.3 + 0.1
= 0.4
Question - What is E(r)?
Reward on transition to good state only: r = r(x,y).
e.g. x = not at waterhole, y = at waterhole.
Problem - may leave state to come back in!
See Discussion.
Reward for arriving and staying in good state: r = r(y).
a a1 a2 a3 Q(x,a) 0 5.1 5Deterministic policy - When in x, always take action a2.
Stochastic policy - When in x, 80 % of time take a2, 20 % of time take a3.
Typically - Start stochastic. Become more deterministic as we learn.
Not just maximise immediate E(r) (i.e. deal with probability).
Maximise long-term sum of expected rewards. This may involve taking a short-term loss for long-term gain (equivalent to allowing a move downhill in hill-climbing).
0 ≤ γ < 1
(e.g. γ = 0.9)
If rewards can be positive or negative, R may be a sum of positive and negative terms.
Special case:
γ = 0
Doesn't work:
γ = 1
(see later)
The interesting situation is in the middle (0 < γ < 1)
r in the future can even override immediate punishment
If we decide what to do based on which path has highest R, then [punishment now, r in the future] can beat [reward now, but not in future] if:
If maximum reward is rmax,
what is maximum possible value of R?
The most it could be would be to get the maximum reward every step:

What is this quantity on the right? Let:
Multiply by γ:
Subtract the 2 equations:
And we get a value for X:
That is:
See Appendix A and Appendix B.
This is just the Negative Binomial Series in the Binomial Theorem.
In the real world, we may not know how to get back to state x
to try another action:
x = "I am at party"
a = "I say chat-up line c1"
result - punishment
I might want to say "Go back to x, try c2" but I may not know how to get back to x. All I can do is try out actions in the states in which I happen to find myself. I have to wait until state x happens to present itself again.
Q-learning can work even when states and actions are experienced haphazardly.
Q-learning is for when you don't know Pxa(y) and Pxa(r).
But one approach is just:
Interact with world large number of times
to build up a table of estimates for
Pxa(y)
and
Pxa(r).
Then just do offline DP.
We keep Q-values Q(x,a) for each (x,a) pair.
We tend to take the action with the highest Q-value:
The Q-value will express the discounted reward above:
Can be built up recursively:
Using this update rule:
the Q-value will bounce back and forth:
Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...
More sensible to build up an average:
Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...
See Appendix A for how to build up a running average like this without needing infinite memory.
In general, we use the Notation of updating the running average D "in the direction of" the latest sample d, rather than assigning directly to it:
If d < D then the running average D goes down. If d > D then the running average D goes up.
So the Q-value is updated in the direction of the current sample:
Theorem B.1 - If d bounded, then D bounded.
α must start at 1 to get rid of Dinit
Thereafter α can be anything from 0 to 1.
D := 0 + 1.d (since α starts at 1)
D := -10 d + 0
D := -10 (-10 d) + 0
D := - 103 d
D := 104 d
D := - 105 d
....
D not bounded
i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax
(it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..
....
i.e. If take this action,
you will get immediate reward rmin
and you have to get next reward rmin
(all actions in next state lead to
the worst reward rmin,
otherwise the Q-value would be higher)
and you have to get next reward rmin ..
....
i.e. You are trapped in an attractor
from which you cannot escape.
You cannot ever again get out and go to a state
with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state,
but a family of states outside of which you can never leave.
Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
Pxa(r) not very useful.
Say in state x and:
Pxa(r=1) = 0.5,
Pxa(0) = 0.3,
Pxa(-1) = 0.2,
and:
Pxb(3) = 0.5,
Pxb(-3) = 0.5.
Take action a or b?
How do we decide?
Too much info!
We want to know E(r) for action decision.
And this is what Q-value is
(or, even more useful, E(R)).
Exercise - What is E(r) for each action?
Maybe
Pxa(r) is useful
where there is a large standard deviation:
Pxb(100) = 0.5,
Pxb(-100) = 0.5.
Q. What is E(r)?
E(r) doesn't tell full story here. May want to avoid the -100 at all costs.
[Although solution here, if you really must avoid a certain
punishment, is simply make that punishment -1000
(while keeping positive rewards fixed to less than 100).
The machine will soon learn to avoid it.]
In other words, who is to say that -100 is a big punishment?
It looks big, but if most rewards are +1000, then it is actually
a small punishment.
What matters is not the absolute sizes of r's,
but their relative sizes.
A machine with reward function:
"if good thing r = 1 million, if bad thing r = -1 million,
else r = 0"
will perform no different to a machine with reward function:
"if good thing r = 1, if bad thing r = -1,
else r = 0"
Proof: See following.
Q(x,a) =![]()
c + γ c + γ2 c + ..to all Q-values.
See also Appendix C and Appendix D.
if (good event) 10 else 5is the same as (add -5):
if (good event) 5 else 0same as (multiply by 2/5):
if (good event) 2 else 0same as (add 3):
if (good event) 5 else 3
if (good event) r1 else r2is the same as:
if (good event) (r1-r2) else 0same as (multiply by something positive, r3 > r4):
if (good event) (r3-r4) else 0same as:
if (good event) r3 else r4
So any 2-reward fn can be transformed into any other, provided we don't change the order of the rewards.
if (good event) r1 else if (bad event) r2 else r3is the same as:
if (good) (r1-r3) else if (bad) (r2-r3) else 0i.e. (where (r3-r2) positive):
if (good) (r1-r3) else if (bad) - (r3-r2) else 0same as:
if (good) (r4-r6) else if (bad) - (r3-r2)(r4-r6)/(r1-r3) else 0same as:
if (good) r4 else if (bad) r6 - ((r3-r2)(r4-r6)/(r1-r3)) else r6
Can get 1st and last rewards to anything, but then restricted in what middle reward can be.
Can't transform it into arbitrary r5.
(r3-r2)(r4-r6)/(r1-r3) > 0So:
r6 - ((r3-r2)(r4-r6)/(r1-r3)) < r6
And:
r6 < r4
3-rewards - 2 good things - 2 goals - many ways to balance how you pursue both - a few hits of middle r may compensate for losing one high r - may pursue middle r and ignore higher one
To build a model of the "laws of physics"
we need a huge data structure
x,a,y -> [0,1]
in which 99 % of the entries will be zero.
e.g. Let x = Waterhole is N, a = move N,
y = Waterhole is here.
Pxa(x) = 0.5
Pxa(y) = 0.5
Pxa(w) = 0
for all the millions of w other than x,y
Millions of them because the waterhole is actually only
1 dimension of the entire n-dimensional input state:
Pxa(w,*,*,*,...,*) = 0
e.g. Chess.
x = opening configuration of board
a = move white pawn
Opponent then moves.
Observe new state y
Pxa(y) = 0 for billions of states y
(e.g. all check states, all states without full no. of pieces)
In most worlds, for any given x,a pair, the probability of going to millions of y is zero.
As Q-values improve in accuracy, we need to re-visit updates we already did. We previously updated:
The r was accurate.
Q(y,b) was totally inaccurate, however,
so once we learn an accurate value for Q(y,b),
we should really do this update again.
If we do store a world model, then we can re-do updates in our head, factoring in more accurate Q(y,b)'s.
e.g. In waterhole above,
we have explicitly stored the fact that y,b -> z
So as soon as
maxd Q(z,d) increases
(i.e. as soon as we discover action c),
we can reflect this change in Q(y,b),
without having to revisit y.
And we know that x,a -> y
so we can further update Q(x,a)
without having to revisit x.
Need to be careful about computational cost of this backtracking:
There is no limit to this replay: As Q(x,a) gets more accurate, the states that lead to x can be updated with more accurate next-Q-value estimates, and so on. z itself may eventually lead to x in a big loop, so that gets updated, and then x leads eventually to z, so that gets updated again, and so on. We can replay any finite set of experiences forever, updating perhaps all Q-values in the system many times before whole system finally converges.
"Dreaming"? - "Reasoning"? - Explicit memories.
Things you can do offline: Are these perhaps best done when the brain is not otherwise occupied? i.e. Perhaps best done when asleep?
If we don't store a model, we have to wait until y leads to z again before we update Q-values for y (and wait until x leads to y again, etc.) - Implicit memories (Q-values reflect our past experience implicitly).
Model good for changing worlds/problems. Also good if experiences are costly / rare.
Instead of building model (huge data structure, mostly redundant, i.e. mostly 0's) we could just store whatever experiences we had and then replay.
Replay more complex with probabilistic world
- must be fair.
e.g.
x,a -> y
with probability 0.1,
x,a -> z
with prob. 0.9.
Say you just experience
x,a -> y
once, and then replay it 20 times.
- A model may be more trouble than it's worth.
Model is useful for changing world - easier to spot that dynamics of the world have changed if you have a prediction of what they should have been.
Still tricky to detect because MDP allows probabilistic world.
Model useful when give the agent new goals in the same world: "Go to state z". See Feudal Q-learning.
If start at:
α = 1/t
then initial Q-values bias our Q-values for some time.
And since we only run for finite time in any finite experiment,
the bias may still be there after learning.
Consider being "born" with Q-values already filled in (i.e. in DNA) and then start learning:
a1 a2
Q(x,a) 0 0
good Q-values to be born with:
a1 a2
Q(x,a) -1000 0
- even if experiment in childhood,
with a moderate temperature
Boltzmann control policy,
still unlikely to try a1.
1/n ( dt + ... + dn ) is not quite right.
e.g.
1/100 ( d98 + d99 + d100)
But in long run it works.
It equals:
(n-t+1)/n . 1/(n-t+1) ( dt + ... + dn )
-> 1 . E(d)
as n -> infinity