We can now look at more sophisticated ANNs, which are known as multi-layer artificial neural networks because they have hidden layers. These will naturally be used to undertake more complicated tasks than perceptron’s. We first look at the network structure for multi-layer ANNs, and then in detail at the way in which the weights in such structures can be determined to solve machine learning problems. There are many considerations involved with learning such ANNs, and we consider some of them here. First and foremost, the algorithm can get stuck in local minima, and there are some ways to try to get around this. As with any learning technique, we will also consider the problem of overfitting, and discuss which types of problems an ANN approach is suitable for.
Multi-Layer Network Architectures
We saw in the previous lecture that perceptron’s have limited scope in the type of concepts they can learn - they can only learn linearly separable functions. However, we can think of constructing larger networks by building them out of perceptron’s. In such larger networks, we call the step function units the perceptron units in multi layer networks.
As with individual perceptrons, multi-layer networks can be used for learning tasks. However, the learning algorithm that we look at (the backpropagation routine) is derived mathematically, using differential calculus. The derivation relies on having a differentiable threshold function, which effectively rules out using perceptron units if we want to be sure that backpropagation works correctly. The step function in perceptron is not continuous, hence non-differentiable. An alternative unit was therefore chosen which had similar properties to the step function in perceptron units, but which was differentiable. There are many possibilities, one of which is sigmoid units, as described below.
Sigmoid units
Remember that the function inside units take as input the weighted sum, S, of the values coming from the units connected to it. The function inside sigmoid units calculates the following value, given a real-valued input S:
Where e is the base of natural logarithms, e = 2.718...
When we plot the output from sigmoid units given various weighted sums as input, it looks remarkably like a step function:
Example Multi-layer ANN with Sigmoid Units
We will concern ourselves here with ANNs containing only one hidden layer, as this makes describing the backpropagation routine easier. Note that networks where you can feed in the input on the left and propagate it forward to get an output are called feed forward networks. Below is such an ANN, with two sigmoid units in the hidden layer. The weights have been set arbitrarily between all the units.
Note that the sigma units have been identified with sigma signs in the node on the graph. As we did with perceptron’s, we can give this
network an input and determine the output. We can also look to see which units "fired", i.e., had a value closer to 1 than to 0.
Suppose we input the values 10, 30, 20 into the three input units, from top to bottom. Then the weighted sum coming into H1 will be:
SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20) = 2 -3 + 8 = 7. Then the σ function is applied to SH1 to give:
σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999
[Don't forget to negate S]. Similarly, the weighted sum coming into H2 will be:
SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20) = 7 - 36 + 24 = -5 and σ applied to SH2 gives:
σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067
From this, we can see that H1 has fired, but H2 has not. We can now calculate that the weighted sum going in to output unit O1 will be:
SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996
and the weighted sum going in to output unit O2 will be: SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047
The output sigmoid unit in O1 will now calculate the output values from the network for O1:
σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750
and the output from the network for O2:
σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957
Therefore, if this network represented the learned rules for a categorisation problem, the input triple (10,30,20) would be categorised into the category associated with O2, because this has the larger output.
Backpropagation Learning Routine
As with perceptron’s, the information in the network is stored in the weights, so the learning problem comes down to the question: how do we train the weights to best categorise the training examples. We then hope that this representation provides a good way to categorise unseen examples.
In outline, the backpropagation method is the same as for perceptron’s:
We choose and fix our architecture for the network, which will contain input, hidden and output units, all of which will contain sigmoid functions.
We randomly assign the weights between all the nodes. The assignments should be to small numbers, usually between -0.5 and 0.5.
Each training example is used, one after another, to re-train the weights in the network. The way this is done is given in detail below.
After each epoch (run through all the training examples), a termination condition is checked (also detailed below). Note that, for this method, we are not guaranteed to find weights which give the network the global minimum error, i.e., perfectly correct categorisation of the training examples. Hence the termination condition may have to be in terms of a (possibly small) number of mis-categorisations. We see later that this might not be such a good idea, though.
Weight Training Calculations
Because we have more weights in our network than in perceptron’s, we firstly need to introduce the notation: wij to specify the weight between unit i and unit j. As with perceptron’s, we will calculate a value Δij to add on to each weight in the network after an example has been tried. To calculate the weight changes for a particular example, E, we first start with the information about how the network should perform for E. That is, we write down the target values ti(E) that each output unit Oi should produce for E. Note that, for categorisation problems, ti(E) will be zero for all the output units except one, which is the unit associated with the correct categorisation for E. For that unit, ti(E) will be 1.
Next, example E is propagated through the network so that we can record all the observed values oi(E) for the output nodes Oi. At the same time, we record all the observed values hi(E) for the hidden nodes. Then, for each output unit Ok, we calculate its error term as follows:
0 comments:
Post a Comment