All we need do is start them off randomly different to get this process going.
Note from graph of sigmoid function that large positive or negative Summed x has a very small slope dy/dx.
dy/dx = y(1-y), and at either end, one of these terms is near zero.
Hence for large absolute xk,
is near zero,
and
is near zero too.
Large absolute Summed x (caused by large absolute weights) causes a small change in weights, very slow learning.
Large absolute weights cause slow learning.
Initialise with small absolute weights (fast learning).
Question - How small is "small"?
Obviously zero would be small enough.
Forward pass:
Backprop, output layer:
∂E/∂tk
= ek (1/2)(1/2)(-1)
tk's changed by different amounts, become
all different.
Backprop, previous layer:
∂E/∂tj
= c(-1)
tj's all same, stay same.
Hidden nodes are all the same, stay the same:
We have a symmetrical network. The hidden units march locked in step. Each node stays identical to the others in the hidden layer. Hidden units don't specialise. The net can't work (as we saw when designing them). No point having n hidden units if they're all the same. You might as well only have 1 hidden unit.
This makes sense. How could the network pick a hidden node to specialise on some part of the problem? Surely whichever node it picked, it would make the same changes for the other nodes too.
Output nodes are different.
But hidden units are all the same, and stay the same.
Also, exemplar outputs in training should be 0.1 to 0.9, rather than 0 to 1, or else very large weights develop. Can't actually get 0 or 1 output without at least one weight going to plus or minus infinity, which causes problems for other exemplars.
Network can start with random values and learn to get rid of these.
But of course that means it can learn to get rid of good values over time as well. It can't tell the difference.
If it doesn't see an exemplar for a while, it will forget it. For all it knows, it has just started learning, and the weights it has now are just a random initialisation! It keeps learning, wiping out anything too far in past.
Learning = Forgetting!
e.g. Extreme Case - We show it one exemplar repeatedly. e.g. Show it "Input x leads to Output 1", 1 million times in a row. The "laziest" way for the network to represent this is to just send the weights to infinity (or minus infinity for Input negative), so Output = 1 no matter what the Input. i.e. Instead of "x -> 1" it learns "* -> 1"
If we show it "x -> 1" a million times, then all weights may be recruited to help "x -> 1". Normally, if we show it "x -> 1" then it does have an effect on all weights, but this effect is countered by the effects of other exemplars. The way the net resolves this tension is by specialisation, where some weights are more-or-less irrelevant in some areas of the input space. Since they have little (though, if outputs are continuous, it will always be at least non-zero, no matter how tiny) effect on the error, the backprop algorithm ensures they are hardly modified. Then when we show it "x -> 1" once, it does have an effect on the weight, but the effect is negligible.
How does the process of specialising work? - As the net learns, it finds that for each weight, the weight has more effect on E for some exemplars than others. It is modified more as a result of the backprop from those exemplars, making it even more influential on them in the future, and making the backprop from other exemplars progressively less important.
First, exemplars should have a broad spread. Show it "x -> y" alright, but if you want it to learn that some things do not lead to y you must show it explicitly that "(NOT x) -> (NOT y)". e.g. In learning behaviour, some actions lead to good things happening, some to bad things, but most actions lead to nothing happening. If we only show it exemplars where something happened, it will predict that everything leads to something happening, good or bad. We must show it the "noise" as well.
How do we make sure it learns and doesn't forget? - If exemplars come from training set, we just make sure we keep re-showing it old exemplars. But exemplars may come from the world. So it forgets old and rare experiences.
One solution is to have learning rate C decline over time. But then it doesn't learn new experiences.
Possible solution if exemplars come from world is internal memory and replay of old experiences. Not remembering exemplars as lookup table, as we had before, but remembering them to repeatedly push through network. But same question as before - Does the list grow forever?
See PhD for a strategy of remembering exemplars and replaying without needing infinite memory.