14 of 14 people found the following review helpful:
4.0 out of 5 stars
A very interesting approach, November 3, 2005
This review is from: Algebraic Statistics for Computational Biology (Hardcover)
This book is unique, in that is fuses together two subjects that may at first be thought of as having zero intersection. Indeed, computational biology and algebraic geometry are very different in their concepts, terminology, and goals, but thanks to the efforts of the editors (and their students), techniques from algebraic geometry are now being applied to problems in genomics, particularly sequence alignment, or at least to rephrase these problems in the language of algebraic geometry. The economy of thought obtained via the use of algebraic geometry though should be measured against the time required to master its subject matter. It might be rare to find a mathematician who is a specialist in algebraic geometry to also be interested in bioinformatics or biological sequence analysis, even though the editors of this book clearly are. Even more difficult would be to find a computational biologist or specialist in bioinformatics who is willing to learn algebraic geometry, which is a formidable subject, even if attention is restricted to a subfield called "computational" algebraic geometry. Those readers familiar with computational algebraic geometry will see that `algebraic statistics', the name that has been given to the subject matter of the book, can be thought of as the algebraic geometry of toric varieties. The models of algebraic statistics include those that are very familiar to researchers in genomic sequence analysis, namely hidden Markov models and graphical models. This reviewer only read the first three chapters of this book, and so the following commentary will be restricted to these.
The authors give an extremely clear description of how statistical considerations arise in genetic sequence analysis in chapter one. To motivate the subject by considering a fictional character called "DiaNA" who is producing sequence data by the throwing of tetrahedral dice, with each face labeled with the letters A, C, G, and T. The dice are not considered to be fair. The problem is to find the likelihood of observing a particular sequence of data. The probabilities of the individual letters are assumed to be functions of parameters ("theta" parameters) and the (log) likelihood function of these parameters must then be maximized. This is straightforwardly by taking partial derivatives. The authors mention `Grobner bases' as a tool for simplifying the resulting equations.
For the authors a general statistical model (for discrete data) is a family of probability distributions on a finite `state space', usually taken to be a string of positive integers of length M. A probability distribution on this state space is thought of as a point in a `probability simplex', the latter being a collection of vectors in M-dimensional Euclidean space (RM) whose entries sum to 1. An `algebraic statistical model' arises as the image of a polynomial map F from D-dimensional Euclidean space (RD) to RM. The model parameters are in RD and D is usually taken to be considerably less than M. A `linear model' is defined as one where each of the M coordinate polynomials is a linear function. In a `toric' or `log-linear' model the logarithms of the probabilities are linear functions in the logarithms of the parameters. The authors show in detail how to formulate maximum likelihood estimation for both of these models and prove that they both have a unique local maximum. Noted in this discussion is the concept of a (relatively open) polytope, which is very familiar in the theory of toric varieties where it is called the `moment map.'
The expectation maximization (EM) technique, used for models that are not linear or toric is also discussed for a class of algebraic statistical models that computational biologists will easily recognize as hidden Markov models. The convergence properties of the EM algorithm are discussed in the language of algebraic geometry in chapter 3 of the book. The closure of the set of critical points of the likelihood function is an algebraic variety in D-dimensional complex space and is referred to as the `likelihood variety'. The likelihood variety is computed by constructing a polynomial ring in M unknowns and D parameters, called the "big ring" by the authors, and s subring in D parameters called the "small ring." An ideal J of the big ring generated by a particular collection of M + D polynomials is introduced, so that the points of the variety V(J) are related to the critical points of the log-likelihood function. The `elimination ideal' I in the small ring is referred to as the likelihood ideal of the statistical model with respect to the data. The optimization problem is solved by computing V(I), intersecting it with the pre-image of the probability simplex, and then identifying the local maxima in this intersection. These constructions motivate the notion of the `maximum likelihood degree' (ML degree) of an algebraic statistical model, which is the number of complex critical points of the log-likelihood function. The ML degree, as defined by the authors, measures the algebraic complexity of the process of maximum likelihood estimation for an algebraic statistical model. In most cases of interest the ML degree will be a positive integer and will be equal to the degree of the (irreducible) polynomial in the coordinates of the maximum likelihood estimate.
Also embedded in the discussions of these chapters is the notion of `tropical' arithmetic and geometry. Tropical arithmetic arises in the discussion of computation and dynamical programming. The `tropical semiring' consists of the extended real numbers along with two operations, called `addition' and `multiplication'. Addition of two real numbers consists of taking the smaller of the two, and `multiplication' consists of adding them. The utility of tropical arithmetic lies in its ability to find the shortest path in a weighted directed graph. The authors show how to make algebraic varieties, and thus statistical models, `tropical.' The tropicalization G of a statistical model is linear on each cone in the normal fan of the Newton polytope of the statistical model. The evaluation of G corresponds to solving the dynamic programs for all possible observations.
Help other customers find the most helpful reviews
Was this review helpful to you? Yes
No
2 of 3 people found the following review helpful:
4.0 out of 5 stars
Algebraic Statistics for Computational Biology, September 21, 2007
This review is from: Algebraic Statistics for Computational Biology (Hardcover)
This text has provided technical examples and learning material for a complicated mathematical/biological/computer science topic. The text is not for introductory purposes, but when used for graduate level course of algebraic statistics, it has proven useful in a variety of manners. The text offers a great overview of the field of algebraic statistics, along with several chapters showing applied examples of the field. This text is recommended for graduate to post graduate level students interested in the field of algebraic statistics.
Help other customers find the most helpful reviews
Was this review helpful to you? Yes
No
5.0 out of 5 stars
Great introduction plus applications!, October 24, 2010
This review is from: Algebraic Statistics for Computational Biology (Hardcover)
There are few sources on this topic, and thus for my research in algebraic statistics for phylogenetic trees, I have come to refer to this book as "the bible." I read the sections applicable to my research several times and now refer back to them repeatedly. It is written very clearly and concisely.
What keeps from actually being comprehensive is that the general discussion is targeted as preparation for the problems they are solving in the later chapters.
Help other customers find the most helpful reviews
Was this review helpful to you? Yes
No