IntroductionIntroduction
This is a book about untrustworthy machines. Machines in fact, which are every bit, as untrustworthy as they are critical to our well-being. But then I don't need to bore you with laundry lists of how prevalent computer systems have become, or horror stories about what can happen when they fail. If you picked up this book, then I'm sure you're well aware of the problems; layer upon layer of interdependent libraries hiding bugs in their abstraction, script kiddies, viruses, DDOS attacks, hardware failure, end-user error, back-hoe's, hurricanes, and on and on. It doesn't matter whether the root-cause is malicious, or accidental, your systems will fail, and when they do, only two things will save you from the downtime; redundancy, and monitoring systems.
Do it right the first time
In concept, monitoring systems are simple, an extra system, or collection of systems whose job it is to watch the other systems for problems. For example the monitoring system could periodically connect to a web server, to make sure it responds, and if not, send notifications to the administrators. And while it all sounds quite straightforward, monitoring systems have grown into expensive, complex pieces of software. Many now have agents larger than 500Mb, include proprietary scripting languages, and sport price tags above $60,000.
When implemented correctly, a monitoring system can be your best friend. It can notify admins of glitches before they become crises, help architects tease out patterns corresponding to chronic interoperability issues, and give engineers detailed capacity planning info. A good monitoring system will help the security guys correlate interesting events, show the network operations center personnel where the bandwidth bottlenecks are, and provide management much needed high level visibility into the critical systems they bet their business on. A good monitoring system can help you uphold your service level agreement (SLA), and even take steps to solve problems without waking anyone up at all. Good monitoring systems save money, bring stability to complex environments, and make everyone happy.
When done poorly however, the very same system can wreak havoc. Bad monitoring systems cry wolf at all hours of the night so often that nobody pays attention anymore, they install backdoors into your otherwise secure infrastructure, leech time and resources away from other projects, and congest network links with megabyte upon megabyte of health checks. Bad monitoring systems can really suck.
Unfortunately, getting it right the first time isn't as easy as you might think, and in my experience, a bad monitoring system doesn't usually survive long enough to get fixed. Bad monitoring systems are just too much of a burden on everyone involved, including the systems being monitored. In this context, it's easy to see is why large corporations, and governments employ full-time monitoring specialists, and purchase software with six-figure price tags. They know how important it is to get it right the first time.
Small to medium sized businesses and universities can have environments as complex or even more complex then large companies, but they obviously don't have the luxury of high-priced tools, and specialized expertise. Getting a well-built monitoring infrastructure in these environments, with their geographically dispersed campuses and satellite offices can be a challenge. But having spent the better part of the last 7 years building and maintaining monitoring systems, I'm here to tell you that not only is it possible to get it done right the first time, but you can do it for free, with a bit of elbow grease, some open source tools, and a pinch of imagination.
Why Nagios?
Nagios is in my opinion the best system and network monitoring tool available, open source or otherwise. Its modularity and straightforward approach to monitoring makes it easy to work with and highly scalable. Further, Nagios' open source license makes it freely available and easy to extend to meet your specific needs. Instead of trying to do everything for you, Nagios excels at interoperability with other open source tools, which makes it very flexible. If you're looking for a monolithic piece of software with checkboxes that solve all your problems, this probably isn't the book for you, but before you stop reading, give me another paragraph or two to convince you that the checkboxes aren't really what you're looking for.
The commercial offerings get it wrong mainly because their approach to the problem assumes that everyone wants the same solution. To a certain extent, this is true. Everyone has a large glob of computers and network equipment, and wants to be notified if some subset of it fails. So if you want to sell monitoring software, the obvious way to go about it is to create a piece of software that knows how to monitor every conceivable piece of computer software and networking gear in existence. The more gadgets your system can monitor, the more people you can sell it to. To someone who wants to sell monitoring software, it's easy to believe that monitoring systems are turnkey solutions, and whoever's software can monitor the largest number of gadgets wins.
The commercial packages I've worked with all seem to follow this logic. Not unlike the borg, methodically locating new computer gizmos and adding the requisite monitoring code to their solution, or worse, acquiring other companies who already know how to monitor lots of computer gadgetry, and bolting that companies code on to their own. They quickly become obsessed with features, creating enormous spreadsheets of supported gizmos. Their software engineers exist so that the pre-sales engineers can come to your office and say to your managers through seemingly layers of white gleaming teeth; "Yes our software can monitor that".
The problem is, monitoring systems are not turnkey solutions. They require a large amount of customization before they really start solving problems, and herein lay the difference between people selling monitoring software and those designing and implementing monitoring systems. When you're trying to build a monitoring system, a piece of software that can monitor every gadget in the world by clicking a checkbox is not as useful to you as one that makes it easy to monitor what you need, in exactly the manner that you want. By focusing on what to monitor, the proprietary solutions neglect the 'how', which limits the context in which they may be used.
Take 'ping' for example. Every monitoring system I've ever dealt with uses ICMP Echo requests, otherwise known as 'pings' to check host availability in one way or another. But if you want to control how a proprietary monitoring system uses ping, architectural limitations become quickly apparent. Lets say I want to specify the number of ICMP packets to send or want to be able to send notifications based on the round trip time of the packet in microseconds instead of simple pass/fail. More complex environments may necessitate that I use IPv6 pings, or that I portknock1 before I ping. The problem with the monolithic, feature-full approach is that these changes represent changes to the core application logic, and are therefore non-trivial to implement.
In the commercial monitoring applications I've worked with, if these ping examples could be performed at all they would require re-implementing the ping logic in the monitoring system's proprietary scripting language. In other words, you would have to toss out the built-in ping functionality altogether. Perhaps, being able to control the specifics of ping checks is of questionable value to you, but if you don't really have any control over something as basic as ping, what are the odds, that you'll have finite enough control over the most important checks in your environment? They've made the assumption that they know how you want to ping things, and from then on it was game over; they never thought about it again. And why would they? The ping feature is already in the spreadsheet after all.
When it comes to gizmos, Nagios' focus is on modularity. Single purpose monitoring applets called 'plugins' provide support for specific devices and services. Rather than participating in the feature arms race, hardware support is community driven. As community members have a need to monitor new devices or services, new plugins are written, and usually a good bit more quickly than the commercial apps add the same support. In practice Nagios will always support everything you need it to, and without ever needing to upgrade Nagios itself. Nagios also provides the best of both worlds when it comes to support, with several commercial options, as well as a thriving and helpful community that provides free support through various forums and mailing lists.
Choosing Nagios as your monitoring platform means that your monitoring effort will be limited by your own imagination, technical prowess, and political savvy. Nagios can go anywhere you want it to, and the trip there is usually pretty simple. And while Nagios can do everything the commercial apps can and more, and without the bulky, insecure agent install, it usually doesn't compare favorably to commercial monitoring systems simply because when spreadsheets are parsed, Nagios doesn't have as many checks. In fact if they're counting correctly, Nagios has no checks at all, because technically it doesn't know how to monitor anything; it prefers that you tell it how. 'How' in fact, is exactly the variable that the aforementioned checkbox cannot encompass. Checkboxes cannot ask 'how', and therefore you don't want them.
What's in this book?
While Nagios is the biggest piece of the puzzle, it's only one of the myriad ...