System Administrators are not known for consensus and conformity. It doesn't take long for new admins to fall in love with a tool or a programming language (or to fall into hate). The Editor Wars are probably the most well known on-going dividing line, but faults can appear around any choice we can make.
This is what makes the books by Limoncelli, Chalup and Hogan (LC&H) so remarkable. If you ask most sysadmins what single book they should read, the answer will almost certainly be "The Practice of System and Network Administration". They're going to have a harder time now, with the release of Volume 2: "The Practice of Cloud System Administration". (Just so you know, it's already known by the abbreviation TPOCSA) I think this is likely to become a must-read.
One of the tenets of TPOCSA (and of all quality design) is "Keep it Simple"). The authors present cloud administration in two parts. Pretty simple, eh? First they define the characteristics of their ideal system, then they go on to describe the methods that they use to try to achieve that ideal.
When I say "describes the system" I mean that in a somewhat abstract way. LC&H aren't talking about which database is best or how much memory you need to render a movie frame. It turns out that all large scale distributed systems have a set of common characteristics. These, along with the requirements for high reliability and robustness have lead to a set of best practices that have become generally accepted largely because they have been shown to work. The hitch is that most of them seem counter-intuitive and nearly all directly contradict standard practices of two decades ago.
In this section the authors also make clear the scope of what "System Administration" means. Up until the advent of virtualization and ubiquitous high-speed networks it meant OS installation, and some network configuration. When the machine was ready it would be handed off to some application and operations team for the rest of the lifetime of a host. The SA tasks would probably include backups and periodic patching. (or at least that's what many people thought). Today System Administration and Operations are largely synonymous. This union even has a word: DevOps (which *is* contentious, so I won't discuss it further here).
So we're talking about a large-scale distributed system. When ever you have something big and made up of lots of parts you inevitably have failures. Much of the rest of the book consists of ways to make that not matter, taking human nature and the "physics" of highly complex systems into account to make robust seamless services which run well even as they are changing.
Scanning over the chapter headings after the section break I am struck by something which should have been obvious. This is a book about Practices. The first section is really a glossary, a base of terminology and concepts on which to build. But what we build from them, the system which results isn't just the our cloud application. The infrastructure that LC&H are talking about here is as much a social one as it is technical. Each of the computational components is meant either to facilitate human communication or to remove painful, time consuming or error prone tasks.
System Administrators are no longer just brick layers and janitors.They are involved in every phase of application life-cycle from inception to a long continuously evolving life span. LC&H discuss the philosphy and practice of each phase, always considering that humans are expensive (and error prone) while computation is cheap. Automation, documentation and monitoring are all reconsidered with an eye to minimizing drudgery and false rigor and replacing it with a mind-set that will evaluate what's really important: comprehension and communication.
I read Gene Kim's "The Phoenix Project" not long after it came out while I smiled and nodded knowingly all the way through, it felt a little like a unicorn story. I thought "This is nice, but no one in business is going to take a novel seriously as a model for business practice". Of course I was wrong, but I still think that something more is needed, not just a parable but a manual. The line where the authors cite Gene for "inspiration and encouragement" indicates that LC&H thought so too.
There really wasn't much in this book that was new to me. I think much of what's here is already fairly common knowledge. What TPOCSA has done is to bring together in one place the accumulated body of knowledge which has been growing and changing since the birth of the Internet. Today's computer systems are a far from the mainframes, minicomputers and PCs which dominated the 1990s. There have been a number of movements triggered by the changes since then; Agile development, the DevOps movement, Continuous Integration and Deployment. TPOCSA brings them all together and reminds us that the methodology, the philosophy, the ideology are not what matters. The System, running and serving reliably is what matters. All of the rest are just means to that end.
So who should read it? I think anyone claiming to be a System Administrator today should be conversant with what's here, but I think the bigger impact will come when we pass it to a colleague, whether a developer or a manager. There's a lot of confusion around what Cloud Computing means, and TPOCSA gives us a common base on which to build our systems and our processes.