9 of 9 people found the following review helpful:
5.0 out of 5 stars
Excellent, January 27, 2010
This review is from: Algebraic Geometry and Statistical Learning Theory (Cambridge Monographs on Applied and Computational Mathematics) (Hardcover)
Statistical learning theory is now a well-established subject, and has found practical use in artificial intelligence as well as a framework for studying computational learning theory. There are many fine books on the subject, but this one studies it from the standpoint of algebraic geometry, a field which decades ago was deemed too esoteric for use in the real world but is now embedded in myriads of applications. More specifically, the author uses the resolution of singularities theorem from real algebraic geometry to study statistical learning theory when the parameter space is highly singular. The clarity of the book is outstanding and it should be of great interest to anyone who wants to study not only statistical learning theory but is also interested in yet another application of algebraic geometry. Readers will need preparation in real and functional analysis, and some good background in algebraic geometry, but not necessarily at the level of modern approaches to the subject. In fact, the author does not use algebraic geometry over algebraically closed fields (only over the field of real numbers), and so readers do not need to approach this book with the heavy machinery that is characteristic of most contemporary texts and monographs on algebraic geometry. The author devotes some space in the book for a review of the needed algebraic geometry.
Also reviewed in the initial sections of the book are the concepts from statistical learning theory, including the very important method of comparing two probability density functions: the Kullback-Leibler distance (called relative entropy in the physics literature). The reader will have to have a good understanding of functional analysis to follow the discussion, being able to appreciate for example the difference between convergence in different norms on function space. From a theoretical standpoint, learning can be different in different norms, a fact that becomes readily apparent throughout the book (from a practical standpoint however, it is difficult to distinguish between norms, due to the finiteness of all data sets). Of particular importance in early discussion is the need for "singular" statistical learning theory, which as the author shows, boils down to finding a mathematical formalism that can cope with learning problems where the Fisher information matrix is not positive definite (in this case there is no guarantee that unbiased estimators will be available). This is where (real) algebraic geometry comes in, for it allows the removal of the singularities in parameter space by recursively using "blow-up" (birational) maps. The author lists several examples of singular theories, such as hidden Markov models, Boltzmann machines, and Bayesian networks. The author also shows to generalize some of the standard constructions in "ordinary" or "regular" statistical learning to the case of singular theories, such as the Akaike information criterion and Bayes information criterion. Some of the definitions he makes are somewhat different than what some readers are used to, such as the notion of stochastic complexity. In this book it is defined merely as the negative logarithm of the `evidence', whereas in information theory it is a measure of the code length of a sequence of data relative to a family of models. The methods for calculating the stochastic complexity in both cases are similar of course.
In singular theories, one must deal with such things as the divergence of the maximum likelihood estimator and the failure of asymptotic normality. The author shows how to deal with these situations after the singularities are resolved, and he gives a convincing argument as to why his strategies are generic enough to cover situations where the set of singular parameters, i.e. the set where the Fisher information matrix is degenerate, has measure zero. In this case, he correctly points out that one still needs to know if the true parameter is contained in the singular set, and this entails dealing with "non-generic" situations using hypothesis testing, etc.
Examples of singular learning machines are given towards the end of the book, one of these being a hidden Markov model, while another deals with a multilayer perceptron. The latter example is very important since the slowness in learning in multilayer perceptrons is widely encountered in practice (largely dependent on the training samples). The author shows how this is related to the singularities in the parameter space from which the learning is sampled, even when the true distribution is outside of the parametric model, where the collection of parameters is finite. This example leads credence to the motto that "singularities affect learning" and the author goes on further to show to what extent this is a "universal" phenomenon. By this he means that having only a "small" number of training samples will bring out the complexity of the singular parameter space; increasing the number of training samples brings out the simplicity of the singular parameter space. He concludes from this that the singularities make the learning curve smaller than any nonsingular learning machine. Most interestingly, he speculates that "brain-like systems utilize the effect of singularities in the real world."
Help other customers find the most helpful reviews
Was this review helpful to you? Yes
No