Mark Humphrys - Research - Action Selection - W-learning movie


Movie demo of W-learning in the Ant World problem



Introduction

For a full description of the Ant World problem see here. Briefly, the creature exists in a toroidal gridworld, populated by static pieces of food and a randomly moving predator. When the creature encounters food, it picks it up. It drops food at the nest, and it may only carry 1 piece at a time. When a piece of food is picked up, another one grows in a random location. We consider the following Q-learning agents:

agent Af generates rewards for itself:
 if (just picked up food) reward = 0.7
 else reward = 0

agent An generates rewards for itself:
  if (just arrived at nest) reward = 0.1
  else reward = 0

agent Ap generates rewards for itself:
  if (just shook off predator (no longer visible)) reward = 0.5
  else reward = 0

Each timestep, the creature senses a state x, each agent inside the creature suggests an action (there may be only one agent inside the creature), some agent Ak wins the internal competition and has its action a executed, then the creature senses a new state y. The caption line of the movies shows each step:

 x [Ak] a -> y 


The 3 agents

First we watch the creature completely under the control of agent Af:



Af movie, 100 steps.
If this does not play, try this:


Af senses the direction of visible food within a small radius (including a value for "none visible"). By Q-learning, Af builds up these Q-values. These values mean that it learns to seek out food when the creature is not carrying any, but then it is at a loss what to do. The only way it can gain any future rewards is to lose the piece of food at the nest, but it cannot learn how to do this because it does not sense the nest. So it just wanders about. If it should accidentally wander into the nest and lose its food, it immediately sets off in search of more, and once successful, will be aimless again. And so on. It completely ignores the predator.

Next we watch the creature under the control of agent An:



An movie, 100 steps.
If this does not play, try this:


An senses the direction of the nest within a small radius. By Q-learning, An builds up these Q-values. If the nest is not visible, An wanders randomly. Once it is visible, An heads straight to it and then, instead of staying put, learns to jump out and back in so it can get that "just arrived at nest" reward again and again! It is happy maximising its rewards, ignoring both food and predators.

Then we watch the creature under the control of agent Ap:



Ap movie, 100 steps.
If this does not play, try this:


Ap senses the direction of the predator. By Q-learning, Ap builds up these Q-values. If the predator is visible, Ap learns to move away from it in the broad opposite diagonal direction. When the predator has gone out of sight, Ap doesn't actually stay put, but wanders randomly in the hope that the predator comes back into sight so it can get the "just shook off predator" reward again! It almost looks as if it is baiting the predator - repeatedly coming near it and then withdrawing. It ignores food.


The 3 agents together

So we have 3 agents, each with rather obsessive ideas about what the creature should do. We put all three into a single creature, and have them compete through W-learning for the right to control it. All three agents are going to end up somewhat frustrated.

By W-learning, the competing agents build up these W-values. These values mean that Af is generally obeyed if the creature is not carrying food, sometimes with competition from Ap when a predator is visible. If the creature is carrying food, Af has no strong opinions about what to do, and Ap is free to dominate if a predator is visible. If no predator is visible, then Ap has no strong opinions either (apart from not wanting to stay still) and the weak but constant signalling of An is finally audible. The result is a predator-avoiding, food-foraging creature in which, at every timestep, 2 of the agents are not being listened to.

We watch the creature under the control of the 3 competing agents Af, An, Ap:



3 agents movie, 300 steps.
If this does not play, try this:


Note in the caption line how control switches from agent to agent. One thing that helps the agents live together successfully is that they are all restless agents. Not one of them ever wants to stay still, no matter what is happening. This makes it easy for another agent to suggest a movement somewhere. We can draw a map of the statespace showing how control is divided up.


Summary

So, to summarise, the agents start out with random Q-values and W-values, hence the creature starts out with random behavior:



3 agents "before" movie.
If this does not play, try this:


By Q-learning, rewards are propagated into Q-values, and by W-learning, the differences between Q-values are propagated into W-values, until the creature finally settles down into a steady pattern of behavior:



3 agents "after" movie.
If this does not play, try this:




How to play Movies



How I made these Movies

My program actually has no user interface at all. During a run, it builds these gnuplot data files. After the run, if I want to see what happened, I tell gnuplot to plot the data files. This actually causes gnuplot to play an animated sequence of images.

To bundle this animation into an MPEG file, I get gnuplot to dump each plot into its own pbm file. The pbm files can then be strung together frame-by-frame into an MPEG.




These movies are also available on a "video appendix" deposited with the 1996 version (PhD 20843) of my PhD thesis in the Manuscripts Room of Cambridge University Library. This VHS video tape plays the 4 Movies above in sequence. First, the creature under the control of agent Af alone. Then An alone. Then Ap alone. Then all 3 competing together in the same body.


Movie demo of the House Robot problem


Return to my home page.