The term neural networks and deep learning often get thrown around haphazardly. For this article, we give a heuristic description of what neural network is and how they can be used to solve many of today’s problems.
Neural Nets were originally inspired in the 1940s by the human central nervous system. The idea that the brain was connected by many nodes (i.e. neurons) with information being relayed back and forth from stimulus to reaction.
With the advent of big data to learn from and computers to perform computations, there has been renewed in Neural Nets. They are typically used to solve supervised learning problems, where we have some known input and some output but don’t necessarily know anything about how the input and outputs are related.
Examples can be as diverse as weather forecasting based on pressure readings, butterfly flaps, and tectonic activities, to translating languages using two sets of the same documents in different languages.
An artificial neural network consists of a set of nodes, connected by adaptive weights.
Each node represents a hypothesis function or activation functions, which are predefined functions of the input weights. The hypothesis function can be something as simple as a linear summation of the input weights too much more complicated formulae (though there are some generally desirable attributes – see common Wiki). For example, a linear node would just give as output a linear function of the input.
After the neural network structure is decided, the weights are randomly initialized (note, these should never be initialized to the exact same value, since you’ll get into trouble with the conserved symmetry).
Training examples are then fed in one at a time. From the input layer, the data is passed (typically in the form of features) to the hidden layers, transformed at each node by the hypothesis function, weighted, and passed through subsequent layers until it reaches the output layer.
This is known as forward propagation. Once a computation is complete we have proceeded all the way to the output layer, we now have this initial estimate of the output.
Since there is typically some error (we did initialize the weights randomly after all), the next step is where the magic happens. The errors, e.g. the difference to the actual value, are computed and then passed backwards through the last hidden layer.
The errors are used to slightly adapt the weights of the previous layer using an error minimization algorithm (gradient descent, Newton’s method, or something similar). This step simply finds the change to the weights which gives the largest reduction in the observed error.
Then, after untangling the amount of error from this node, the remaining error (the residual) is then passed to the previous layer before this one. This continues until you have again reached the input layer, with all weight corrections computed.
Now that you have completed an entire pass (one forward and one backward propagation step) for the first training sample, you can proceed in one of two ways.
- Update the weights to their new found values and run the next training example
- Run another N-1 training samples without updating the weights (but saving the corrections for each of the N samples).
In the second case, the weights are then updated after the entire batch N have been run through, with the final correction to the weights being an average of all the computed corrections for the training data.
Intuitively, the former is a greedy algorithm, which can be run “online” and continuously, while the latter typically giving better performance, having smoothed out any individual abnormalities in the data.
Each training example is potentially passed through the network multiple times (this is normally denoted by the number of iterations in algorithms). With each update to the weights, the neural net gets more and more accurate.
The basic concept is to define a set of functions (neurons) that when combined by weights give an output that matches the observation in the training data. To match the output, the weights are iteratively corrected to reduce the error of the hypothesis generated by the network.
Notice that the full structure of the neural net has to be predefined, and it is the weights themselves that can be continuously/semi-continuously (in batches) updated. There are extensions to the typical neural networks that exhibit adaptive structure as well. These are known as recurrent neural nets (RNNs) that we shall discuss in a subsequent entry.
Further, Recurrent Neural Nets have some notion of state, being able to “remember” past sequences. They are also (possibly) Turing Complete; this means that they can compute any computation that can be done.
Amazingly, they are also universal approximators and can, with well-chosen activation functions for their neurons, approximate any function. This basically means that if there is some relationship (continuous mapping) between input and output, a neural net will be able to learn the underlying relationship.
Unfortunately, this makes no guarantee on how fast or simple the convergence to the true underlying function, nor does it make any guarantee on the accuracy of the data relative to the model.
Indeed, even though the range of applications and accuracy is impressive for neural networks, they may not be the ideal algorithm for your data problem. Neural nets, unfortunately, suffer from difficult interpretability and in general, a more lengthy and complicated tuning procedure.
It will always be necessary to determine the algorithm that best suits your data. There is no free lunch!