"How to start with Site Reliability Engineering without breaking the bank?"
Agile and DevOps are today more or less mainstream. Although many organizations are just in the beginning of their Agile and/or DevOps journey, the basic ideas and goals of these two are both well understood and well accepted by most organizations. A new acronym in the “Cloud, Agile and DevOps” space that for the past six months (at least here in Sweden) has gained significant attention, is SRE – Site Reliability Engineering.
Site Reliability Engineering is Google’s approach to Service Management. Grown from how Google runs its google.com, it is a new approach on how to run, operate and manage IT systems. The old ways of “sys admins”, dedicated operations teams and, in most cases, heavy ITIL-based processes are challenged while new concepts like Service Level Indicators (SLI), ‘blameless postmortems’ and Error budgets are introduced. The merge of development and operations that DevOps in many ways tries to address is fully embraced but with some small twists and added scope here and there. Google likes to describe SRE as:
“SRE is what happens when you ask a software engineer to design an operations team”
SRE, the not so new kid on the block
The term SRE was first introduced back in 2003. A Googler named Benjamin Treynor, the originator of the term, was managing a production team consisting of seven engineers. The goal of this team was, like in many other cases, to make sure that the Google websites were available and reliable. The classic split of responsibilities between the product development team and operations team and the conflicting goals of the two teams (one launches new features vs. the other makes sure that the service doesn’t break) was evident and resulted in difficulties to achieve the overall business goals.
As a result, SRE was born and became the new paradigm for how Google runs large-scale systems as well as facilitates the continuous introduction of new features.
So, What Is SRE?
It is not my intention in this blog post describe in detail all the aspects and components of SRE. There are several great books from Google and count-less Internet resources and YouTube videos that already do this in a great way. I will, however, try to summarize some of the key SRE principles in with the following bullets.
- Embrace risk – 100% reliability is most likely the wrong target. In most cases it doesn’t make sense to spend the kind of effort (and money) that 100% reliability requires when consumers in many cases are dependent on a chain of different service providers like ISPs and Telcos to use your service.
- Identify, monitor, and manage key aspects of your services. Start with identifying your key indicators (SLIs) to track if your “end-users” are happy or not and set goals/objectives for these, your Service Level Objectives (SLOs). For example, the request latency SLI is a common one to track since it usually has a direct connection to “customer happiness”. Based on this you can then set an SLO like this:
– 90% of requests with the last 28 days shall be served to the customer < 450ms
- Use Error budgets. An error budget is 1 minus the SLO of the service. This means that a 99.9% SLO service has a 0.1% error budget. Converted to time this means that ~43 minutes of unplanned downtime per month is acceptable by all stakeholders. As long as you have not spent this “budget” you can continue to make changes like new deployments, etc.
- Eliminate toil. Any manual, repetitive work that needs to be done as part of “normal operations” is considered “toil” and should be removed/reduced over time. A common characteristic of toil is that it grows at least as fast as its source. For example, the time you spent on restarting servers will grow as the number of servers grows. Reducing the time/effort spent on toil (usually by various forms of automation) is a vital part of SRE.
How to get started with SRE?
There is no widespread consensus on exactly how to organize SRE team(s) or exactly what the site reliability engineer’s role entails. Google has outlined a number of ways to organize and implement SRE teams but my advice is to start simple by having your product teams adapt some of the core concepts of SRE and then grow it from there. Once the core “lingo” and concepts of SRE are understood, adopted, and accepted by product owners, Cloud engineers and business stakeholders, the way forward should be much easier. So where do you start?
The art of SLOs workshop, a great place to start
To help organizations start defining and working with SLIs, SLOs and Error budgets, Google has published the content that their own Customer Reliability Engineers (CREs) used for a workshop called “Art of SLOs”. The Art of SLOs teaches the essential elements of developing SLOs to an audience from across development, operations, product, and business stakeholders. All the materials for the workshop, including handbooks for both participants and facilitators, are freely available to use. The main goal of the workshop is to introduce participants to the way Google measures service reliability—in terms of Service Level Indicators (SLIs) and Service Level Objectives (SLOs)—and give them some hands-on experience creating these measures in practice.
Toil hunting and reduction
As stated before, toil is work attached to running a production service. It tends to be manual, repetitive, and usually scales linearly as a service grows. To start with the identification and the reduction of toil is a key success factor for all SRE initiatives and should be prioritized.
Quick summary of how to start
To summarize, my suggestions on how to start your SRE journey without “breaking the bank”, is to start with the following steps:
- Run the “Art of SLOs” workshop with your teams. This will give you a kick start to learn and understand the key concepts of SLI, SLO and Error budgets. You and your team can apply the knowledge to your own services and applications.
- Identify your own SLIs and SLOs and start to track/monitor these in your monitoring tool of choice. Many modern monitoring platforms (like Stackdriver) have built-in support for this.
- Start to eliminate toil in a controlled and pro-active way. Maybe start to appoint one resource in each team and allocate 20-30% of his/her time for “toil-identification and reduction”. Usually these tasks start around CI/CD or the automation of time-consuming tasks such as unit and integration testing. Another common area can be Infrastructure-as-Code.
Of course, I am not saying that the steps listed above are enough for a full blown SRE implementation/adoption but it is a good first step for organizations to take and subsequently build on. The goal here is to start small with a rather moderate level of ambition, evaluate if this is “right for you” and then accelerate.