You need to sign in or sign up before continuing.
Take a photo of a barcode or cover
39 reviews for:
Site Reliability Engineering: How Google Runs Production Systems
Niall Richard Murphy, Chris Jones, Jennifer Petoff, Betsy Beyer
39 reviews for:
Site Reliability Engineering: How Google Runs Production Systems
Niall Richard Murphy, Chris Jones, Jennifer Petoff, Betsy Beyer
PS: This isn't really a review. It is a note for myself.
I read 5/6 chapters. But participated in team wide discussions on this book, so have a sense of what more is in store. I definitely want to go back and finish it at some point.
I read 5/6 chapters. But participated in team wide discussions on this book, so have a sense of what more is in store. I definitely want to go back and finish it at some point.
If you are working in SRE, DevOps, CloudOps, or looking to get into those areas, this presents a good read and introduction to concepts and approaches. It is not something you need to or should read straight through. Focus on the chapters that are of interest or new to you. SRE/DevOps has grown since this book was written and introduced, so in a number of the chapters the custom built applications and code don't really apply now (I think in most instances now there are off the shelf applications that can fill the needs of most organizations vs. having to custom build solutions). The 2 drawbacks to this book is that some of the areas approach it from a very, very large scale, which a lot of organizations won't necessarily encounter themselves (very large, geographically dispersed data centers). So there is a technical focus at that scale which most likely won't be relevant.
The other drawback is that the voice its written, even though there are a variety of authors is one that is too overly positive in its examples and tones. Even in the worst problems presented and handled, they were all still handled well and perfectly.
Tone aside, I think that this provides a good broader perspective on topics that I think are important and germane to moving from traditional IT roles into better practice, no matter if you are looking to apply this in traditional data centers or in the cloud, or what terminology you want to apply (SRE, DevOps, DevSecOps, CI/CD, CoreOps, etc....)
The other drawback is that the voice its written, even though there are a variety of authors is one that is too overly positive in its examples and tones. Even in the worst problems presented and handled, they were all still handled well and perfectly.
Tone aside, I think that this provides a good broader perspective on topics that I think are important and germane to moving from traditional IT roles into better practice, no matter if you are looking to apply this in traditional data centers or in the cloud, or what terminology you want to apply (SRE, DevOps, DevSecOps, CI/CD, CoreOps, etc....)
A very nice collection of essays on topics that range from "people topics" to "nontrivial tech topics" - I enjoyed the variety and the "horizon-broadening" perspective.
This is a complete collection of everything about building the SRE team, from their practices to how to onboard a new SRE to the team.
I am personally really inspired by the concept of error Budget and the share by default culture folders by practices such as blameless postmortem.
I am personally really inspired by the concept of error Budget and the share by default culture folders by practices such as blameless postmortem.
It was really great. Didn't finish it though because it is long, repetitive, disorganized, overly thorough, and not totally relevant to what I'm working on anymore.
The first thing that comes to mind about this book is how massive it is. Most of my peers who have read it have read one or two chapters, skimmed one or two more and called it a day. I can see why. However, I found a lot of value in each part of the book. Look, google runs a LOT of stuff. And the run it really well.
I've worked in Ops teams, Dev teams and DevOps teams, and this book gives you plenty to think about wherever you fall in that spectrum, whether you work in an organization where SRE is being considered or not.
My favorite quote from the book, attributed to Joseph Bironas:
I've worked in Ops teams, Dev teams and DevOps teams, and this book gives you plenty to think about wherever you fall in that spectrum, whether you work in an organization where SRE is being considered or not.
My favorite quote from the book, attributed to Joseph Bironas:
If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.
This book details a lot of modern-day infra-related concepts and the rationale behind them.
Examples include the discussion around Service-Level Objectives (SLOs).
- If an SLO is 99.99%, then this leaves you with an error budget of 0.01% (52 minutes a year). What surprised me is how product teams pushed back against increasing this further without serious justification, due to the impact on engineering velocity if they are only allowed 52 minutes of downtime a year, thus preventing an ability to embrace risk & move fast.
- The team should aim for higher than 99.99% anyway, but should not be penalised unless breached.
- An SLO should never be 100% as there are too many real-world factors that can't be controlled here such as ISPs having downtimes.
- How do you measure an SLO? A simple one would be (successful requests)/(total requests) as a %. It should also consider measuring it from client-side rather than server-side since these are the ones who have their UX impacted.
- Another idea is not to be "too available" or teams can become dependent on it, but I guess this depends on the services provided.
Toil
- Hands-on time running a script is toil. If it can be automated away but hasn't been, then it is toil.
- Overhead is not the same as toil. Overhead is things like meetings, code reviews, etc.
- SREs @ Google are expected to spend <50% of their time dealing with toil (20h a week) max. Most teams are at about 33%.
- Engineers are expected to complain loudly if there is too much toil as it will increase attrition of the best engineers if not dealt with. This is because not enough time is spent on new projects, causing stagnation and low morale. If someone is content with toil, then others will give that work to them too.
Misc ideas
- Testing is used for known data whereas monitoring is used for unknown or unpredictable data
- Responses are measured differently based on context. Throughput is used for sending videos (guarantee sending media even if it takes longer), whereas latency is for searching for videos (faster results to the users).
- Paxos variations and how they are used at large-scale.
Examples include the discussion around Service-Level Objectives (SLOs).
- If an SLO is 99.99%, then this leaves you with an error budget of 0.01% (52 minutes a year). What surprised me is how product teams pushed back against increasing this further without serious justification, due to the impact on engineering velocity if they are only allowed 52 minutes of downtime a year, thus preventing an ability to embrace risk & move fast.
- The team should aim for higher than 99.99% anyway, but should not be penalised unless breached.
- An SLO should never be 100% as there are too many real-world factors that can't be controlled here such as ISPs having downtimes.
- How do you measure an SLO? A simple one would be (successful requests)/(total requests) as a %. It should also consider measuring it from client-side rather than server-side since these are the ones who have their UX impacted.
- Another idea is not to be "too available" or teams can become dependent on it, but I guess this depends on the services provided.
Toil
- Hands-on time running a script is toil. If it can be automated away but hasn't been, then it is toil.
- Overhead is not the same as toil. Overhead is things like meetings, code reviews, etc.
- SREs @ Google are expected to spend <50% of their time dealing with toil (20h a week) max. Most teams are at about 33%.
- Engineers are expected to complain loudly if there is too much toil as it will increase attrition of the best engineers if not dealt with. This is because not enough time is spent on new projects, causing stagnation and low morale. If someone is content with toil, then others will give that work to them too.
Misc ideas
- Testing is used for known data whereas monitoring is used for unknown or unpredictable data
- Responses are measured differently based on context. Throughput is used for sending videos (guarantee sending media even if it takes longer), whereas latency is for searching for videos (faster results to the users).
- Paxos variations and how they are used at large-scale.
informative
slow-paced
Its a lot of information! I'll probably have to listen to it again.
medium-paced
The Bible.
This book is so good--basically creating a field and defining a set of practices that needed a lot of definition.
Helpful for everyone regardless of where they are on in their path of ensure prod is up--covers on boarding, on call, and post mortem's in such honest clarity is is disarming.
the books is not all written with equal strength and clarity, but it is all worth reading and the good stuff is honestly life changing.
Strongly recommended to anyone that uses a computer.
Gold.
This book is so good--basically creating a field and defining a set of practices that needed a lot of definition.
Helpful for everyone regardless of where they are on in their path of ensure prod is up--covers on boarding, on call, and post mortem's in such honest clarity is is disarming.
the books is not all written with equal strength and clarity, but it is all worth reading and the good stuff is honestly life changing.
Strongly recommended to anyone that uses a computer.
Gold.