Consider - In state x, choice of 50 actions.
Pick 1, useless.
Eventually, 10,000 timesteps later, we're in x again,
pick another 1 (action), useless.
Eventually, in x again and by chance we pick the future a*.
Normally this gives r with p=0.8.
Unfortunately this is one of the p=0.2 times so it returns 0.
We can't tell difference between it and the other actions.
Even worse, next time we are in x,
we try a worse action b which normally (p=0.9) gives reward 0,
but just this once returns r = 1.
So right now we think b is the best.
We have to try b multiple times in x to realise how good it is,
and we have to try a* multiple times in x to see its probabilities.
Problem - May only ever visit boring, mainstream states -
only small part of the entire
statespace.
Interesting area of statespace:
Hard to visit.
Need skilled sequence of actions to even get into these states.
But high r's.
When can't try all a in all x repeatedly in the finite time available:
Non-random exploration policy to do "smart" exploration in the finite time available. Get focused on a small no. of actions quickly. Get increasingly narrow-minded. Need a heuristic rule: some rule to focus on promising actions and states.
Note however that if we have any policy other than try all a's endlessly in all x's, we may be unlucky and miss the optimal actions. A heuristic is "a rule that may not work" (could be unlucky). In a large state-space we may have no choice and have to take that chance.
Q. What other heuristics could we use?
Instead of random actions, we tend to pick actions that have generated rewards before. As time goes on we only focus on one or two actions in each state, e.g. above we gradually focus on a* and b in state x, and eventually solve that tie-break.
Here the selection is between actions a, which have Q-values Q(x,a). In state x, the probability of picking action a, px(a), is:
Highest Q will always be strictly more probable.
Say Q-values start at zero:
a a1 a2 a3 a4 a5 a6 Q(x,a) 0 0 0 0 0 0After some time, they are as follows (actions on 0 haven't been tried yet):
a a1 a2 a3 a4 a5 a6 Q(x,a) 0 3.1 -3.3 0 -3.5 3.3Now a high temperature would still pick actions randomly:
a a1 a2 a3 a4 a5 a6 p(a) 1/6 1/6 1/6 1/6 1/6 1/6
A moderate temperature would pick actions with probability something like:
a a1 a2 a3 a4 a5 a6 p(a) 0.2 0.3 0 0.2 0 0.3A low temperature would pick actions with probability:
a a1 a2 a3 a4 a5 a6 p(a) 0 0.1 0 0 0 0.9
A good question is how does our policy deal with actions that haven't been tried yet. Q-values may start at any values. 0 is not necessarily neutral. 0 isn't necessarily small (or large). It might be a high or low Q-value.
(*) If we have time.
If not, we could finish after finite time
with the best actions being all
actions we have never tried!
i.e. After learning, our policy of pick the best action
yields a more-or-less random policy!
Note we do know that the action hasn't been tried, because we are keeping a different α for each pair (x,a) (see here).
Be narrow-minded about a's we have tried, random about a's we haven't tried?
then 0 is a neutral reward, but that doesn't mean it's a neutral Q-value.if (good thing) r = 1 else if (bad thing) r = -1 else r = 0
Stochastic control policy - Still have non-zero probabilities at run-time. - "Animals" - Usually, but not totally predictable - Don't "crash" or get stuck in infinite loop - Will eventually do something different
Experiment:
Wasp (or was it spider?) with prey checking tunnel
(Can anyone find the original reference?):
"A famous Frenchman once studied the life cycle and behavior of a wasp species. After mating, this species digs a tunnel in sandy soils, flies away, catches a prey caterpillar and anesthetizes it, carries it back to the tunnel, puts it near the opening, goes into the tunnel to "check" that everything is OK, comes out, drags the caterpillar down to the end of the tunnel, injects an egg into the caterpillar, then flies away, never to return. From the egg a wasp larva hatches, gradually eats up the caterpillar, turns in due course into a cocoon, metamorphoses, and emerges as a full-fledged sand wasp. If it's a female, she now flies away and either hibernates over the winter or immediately repeats the previous cycle of mating, digging tunnels, catching caterpillars . . . Common sense suggests that this behavior is not learned. These wasps mature alone in the tunnel; everything they do seems to be accomplished with no one's instructions. Like the shape of their wings, or like the color of your eyes, their behavior is genetically determined.
This conclusion receives further confirmation from experimental interventions. While the wasp was going in to "check" the tunnel, the naughty Frenchman removed the caterpillar a short distance away. Emerging from the tunnel, the wasp brought the caterpillar back to the same place, then proceeded to go down the tunnel once more. The Frenchman removed it again, and the wasp, upon re-emerging from the tunnel, brought it back, once more going to "check" the tunnel. This process continued many times, until the wasp finally gave up and took the caterpillar straight in, or until our naturalist gave up and let the caterpillar be (I forgot which). All this is reminiscent of a broken computer loop; you short-circuit the loop and get stereotypic, maladjusted behavior."
Admittedly, animals not getting stuck in infinite loop is also to do with:
world changes (no absorbing states),
internal state changes (hunger),
ongoing competition among multiple drives,
etc.
I say in PhD that even when finished learning: "instead of using a deterministic strict highest Q I used a low temperature Boltzmann". - I am assuming I do not have an optimal policy (e.g. I have only approximated Q, perhaps not MDP anyway).
Makes no sense to do this if I have an optimal policy. - Then should use strict highest Q.