Site Reliability Engineering: How Google Runs Production Systems 1st Edition
Jennifer Petoff (Author) Find all the books, read about the author, and more. See search results for this author |
Betsy Beyer (Author) Find all the books, read about the author, and more. See search results for this author |
Chris Jones (Author) Find all the books, read about the author, and more. See search results for this author |



Use the Amazon App to scan ISBNs and compare prices.

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems?
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day to day work: building and operating large distributed computing systems
- Management—Explore Google's best practices for training, communication, and meetings that your organization can use
Customers who viewed this item also viewed
From the brand

-
Sharing the knowledge of experts
O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.
Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.
-
From the Publisher

This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
- Management—Explore Google's best practices for training, communication, and meetings that your organization can use
How to Read This Book
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)
You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.
We hope this will be at least as useful and interesting to you as putting it together was for us.
— The Editors.
![]() |
![]() |
|
---|---|---|
Site Reliability Engineering | The Site Reliability Workbook | |
Explore the book & companion workbook | How Google Runs Production Systems | Practical Ways to Implement SRE |
Editorial Reviews
About the Author
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
^Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
^Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
^Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.
Product details
- Publisher : O'Reilly Media; 1st edition (May 10, 2016)
- Language : English
- Paperback : 550 pages
- ISBN-10 : 149192912X
- ISBN-13 : 978-1491929124
- Item Weight : 2.05 pounds
- Dimensions : 7 x 1.12 x 9.19 inches
- Best Sellers Rank: #34,535 in Books (See Top 100 in Books)
- Customer Reviews:
About the authors
Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She has previously written documentation for Google's Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
- Currently head of Azure SRE in Microsoft - Dublin, Ireland
- Twitter http://twitter.com/niallm
- Photos at http://www.edge-cases.photos
Jennifer Petoff is a Program Manager for Google's Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
Discover more of the author’s books, see similar authors, read author blogs and more
Customer reviews
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
Learn more how customers reviews work on Amazon
Reviewed in the United States on May 3, 2016
Top reviews from the United States
There was a problem filtering reviews right now. Please try again later.
I bought the Kindle version anyways because I spend enough time in front of a backlit screen that it seemed worth it to read something this large using a device that's better on your eyes. Unfortunately the Kindle version is formatted terribly and I wish I'd bought the print version instead. The book is broken up into Parts which are broken up into Chapters which are further broken up into headlined sections. The Kindle version identifies those headlined sections as chapters which is somewhat useless.
Anyways, the first few chapters aren't especially useful unless you work at Google. They mostly discuss what's unique about Google's computing infrastructure. Despite this, they were EASILY my favorite part of the book because the material is so interesting and their approach is so unique. After that, each chapter is written in a way that it can stand on its own if you aren't reading the entire book, or are reading it out of order. This is convenient for people who want to pick and choose what parts they want to read, but means that people who are reading the entire thing wind up getting a lot of the same information multiple times. It's all written by different people too, which on the one hand makes it not quite as repetitive, but on the other hand makes it hard to just skim over the sections with info you already have because you don't recognize it as information you already know until you've processed it.
Overall this is a fantastic book on DevOps, SRE, and current trends in the industry, It's a great read for anyone who wants to apply some "best practices" to their role. I would however say that reading the entire thing is overkill for most people and not necessarily the best use of your time if you have other things you'd like to be learning as well.
Part 1 - Fascinating read. I imagine this would be a good overview if you're about to start at Google and want a sneak peek at how things are done, but I'm only speculating this as an outsider.
Part 2 - Interesting and useful concepts for modern cloud computing.
Part 3 - Some useful info and a lot of stuff that's not really unique to Google in my experience. Read the parts that you think you could use some improvement on, skip the rest.
Part 4 - A condensed view from a managerial perspective of things you already read in Part 3.
Part 5 - Some case studies, comparisons from other businesses, a useless recap, and examples that could be useful to share using the website version of the book if you're trying to explain to your team what new concepts are being implemented.
It's a collection of stories and tools. Some of the stories are irrelevant to me. But others made me nod, smile, and, ultimately, learn important definitions and techniques I can apply to our own SRE practice at DigitalOnUs.
The other thing that annoyed me was that EVERY tool they talked about was written in-house and has almost no relevance in the real world. Someone needs to write a (shorter) 'real world' SRE book.
This book has a lot of great information, which I found invaluable over the years. One of the harder thing for growing organizations is to keep teams focused, and I've seen that DevOps and SRE practices help to zero in on what is essential.
A lot of Automation related work feels like 'yak shaving,' which is a term to refer to entirely unrelated things that don't add value to our product. For development teams, this feels very frustrating. Why would I want to make a script to automate this? We only use it once a year!
SRE helps to solve these frustrations, to some extent, with practices that help organizations understand why should they communicate, why should they talk about issues, and why we measure some things on some level and not others.
There are good concepts that can be picked up throughout, but enough time has passed since this book was written you may be able to save the overhead of parsing Google specifics from general concepts via some blog posts or youtube videos.
Top reviews from other countries

If you are a non-technical manager, a lot of the content is probably too technical.
If you are a fairly technical operations/support person, you'll likely come away from it thinking "yeah that's all well and good but how am I supposed to get my entire organisation into this way of thinking?"
The book tells a nice story but offers little practical advice for anyone embarking on a SRE or DevOps journey, or for anyone looking to bring it into their organisation.


This is the direction that infrastructure teams should be heading in terms of skill levels too.

