Artificial Neural Networks, known affectionately as 'networks', constitute a class of signal processing algorithms1 that bear some, however remote, resemblance to 'wetware' neural networks, such as the nervous systems of animals (like the human brain). Still, this is not really artificial intelligence, at least not on its own, and this is not a good mathematical model of actual physicochemical brains.
Viewpoints on Artificial Neural Networks
Several scientific communities contribute to the theory of artificial neural networks, and most of these have their own viewpoints on them. These communities include electrical engineering, signal processing, mathematical statistics, computational science, complexity theory, artificial intelligence and even some quantitative neurobiologists.
This has the effect that if you speak with an artificial intelligence researcher, s/he might tell you that networks can be seen as one building block that can be used in forging novel intelligence, whereas the guys down at electrical engineering will argue that this is just a case of non-parametric function-approximation with adaptive basis-functions. A frequentist statistician might mumble about Bayesian non-stringency, and a quantitative neurobiologist could speak for hours and hours on how unfathomably complex the real wetware networks actually are. Take your pick; most of these people can tell really cool stories!
Artificial neural networks have proven to be practical, robust tools that are used in many applications; distinguishing bombs and weapons from alarm clocks in semi-automatic airport x-ray machines; translating spoken words into computer commands; and the control of autonomous robots, to mention a few. Some of the network theory helps by defining a conceptual vocabulary that enables scientists to more accurately describe the vastly more complex phenomenon that we observe in systems like our own brains.
As usual, there are problems as well. Even if you have a nice network that does its job, it is almost impossible to tell just how it does it. This goes along the same lines as asking a natural talent how s/he does whatever s/he is good at. They just do it. Artificial neural networks also typically involve the use of non-linear optimisation (explained later), and are then largely dependent on the performance of this rather difficult procedure.
Architecture and Training
A neural network is composed of individual, locally-connected units termed neurons. Typically these sum up the effects of their respective input connections, weigh them according to their own fashion and transform this weighted sum with a non-linear function. The latter function is often termed activation function, in analogy with the biological neuron.
Connecting several layers in succession is a great idea. One can show that a network with only two layers of adaptive weights suffices to model any function, given 'enough' neurons. This is just theory, however, and one should note that it only grants the existence of such a network solution - not the ability to actually find it! That requires some kind of learning procedure.
Changing the network parameters is termed network training. This procedure is also referred to as weight adaption, or in short 'learning', since that is really what's going on. One can think of learning as attempting to store data in a way that allows generalisation.
Training of a network can be done by most types of standard, non-linear optimisation algorithms such as gradient descent or BFGS2. To understand this, picture the network parameters as latitude and longitude in a large, and often insanely multi-dimensional, rolling landscape, where the altitude represents how far from the desired answer the output is. The optimisation algorithm then strolls along on the surface trying to find as low a valley as possible.
A very neat feature developed in this context is 'error-backpropagation'. This solves the problem of assigning the blame for bad prediction to individual neurons (aka the credit assignment problem). Neurons are very local creatures, remember, but using differentiable non-linearities means that we can use the chain-rule 3 to determine who did what to the final result. In the landscape analogy this corresponds to computing how steep the terrain around the current location is.
Roughly, one can divide the learning procedures of most learning systems, including artificial neural networks, into supervised or unsupervised learning. In the case of supervised learning, a set of training samples is used. When no 'book of answers' is present, training is unsupervised.
For instance, attempting to predict tomorrow's Dow-Jones index using as input variations in interest rates certain key numbers of the largest companies, and the amount on transpiration present on Mr Greenspans' hands, would be a case of supervised learning. We have both input data and target data.
If we instead present a large number of spoken Finnish words to a network, and modify the network slightly according to some local, competitive rules with each new word, we end up with neurons that each recognise one Finnish phoneme (a dinstict unit of sound which separates words). This is a case of unsupervised learning; no-one told the network what phonemes there are - it found them itself. The unsupervised case includes the many interesting self-organisational techniques. Self organisation, although being mathematically rather simple, is robust and seems to be frequently employed in carving the layout of several systems in at least mammal wetware networks, including hearing, vision and language processing. We may have discovered these principles, but they probably played a key role in turning us into what we are, opposing the forces of increasing entropy by creating global order out of local interactions4.
The most powerful property of an artificial neural network is the ability to generalise, not only to reproduce previously seen data, but also provide correct predictions in similar situations.
For instance, feeding a network with a few words, pronounced by a few speakers, may allow it to recognise these words when spoken by a previously unheard speaker.
In the early days (the 1960s), guys like Rosenblatt and Widrow built fascinating, linear and mostly single-layer networks using lots and lots of transistors. The development took an embarrassing halt shortly afterwards after proof that these types of networks were fun, but rather useless. It was not until the 1980s that they became popular again, due to a paper Rumelhart published in the very influential book Learning Internal Representations by Error Propagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition5.
This paper brought attention to things that made artificial neural networks surpass their believed limitations by introducing differentiable non-linearities6 and multi-layer networks. This had really been discovered already by some guys in the 1970s. Nobody really remembers these guys, though - for instance Werbos (1974), but he did it in his dissertation, and no-one really reads dissertations.
Although the feed-forward network architectures are most employed in engineering applications, allowing some down-stream neurons to connect to up-stream ones adds an interesting feature - dynamics. Such recurrent networks exhibit the characteristics of complex, adaptive systems. This can be used for various useful things, but is mainly invoked to de-blur noisy images or recognise the handwriting of engineering students7. This might also be a possible link to artificial intelligence, since complex dynamical systems are more or less prerequisites for sentience. But then a waterfall is also a complex dynamical system, so don't expect too much.
Bayesian statistics can be used to explain and improve much of artificial neural network theory and practice. Notably, a standard feed-forward network under certain circumstances can be shown to approximate aposteriori class-conditional probabilities (how likely is it that this person belongs to that class, given all that we know about him?). Bayesian networks allows keeping track of errors in data, and estimating the likelihood of error in network estimates.
If you read only one book on neural networks, that book is recommended to be Neural Networks for Pattern Recognition, CM Bishop (1995), Oxford University Press.