Site Reliability Engineering – An Overview

Site reliability engineering (SRE) has recently gained traction in the IT industry. It works as a wonder for cloud-native software delivery with the speed of modern IT operations. Furthermore, Site reliability engineering role is not limited to IT but also expanding well across several other industries. Today SRE has turned into a full-fledged IT domain which leverages automated solutions for IT operations like performance planning, on-call monitoring, disaster response, capacity planning, and so on.

SRE is an initiative of tech lexicon by Google’s VP Benjamin Treynor Sloss and his team. The primary goal of Site reliability engineering is to empowering software developers in getting ownership of the applications in production. These make things faster as much as possible from developers’ end whereas minimizes the chance for the operations team to blow up in production.

To explain more, SRE is a perfect amalgamation of infrastructure automation with continuous delivery. However, SRE is ideal for cloud-native and SaaS companies. Besides it offloads many responsibilities related to IT Ops to the development team. Though sounds quite similar to DevOps, SRE is quite different from DevOps, which we will discuss later.

What is SRE

Site reliability engineering Google definition describes it as “automate their way out a job.” It is an application of software engineering methods on system administration related activities. Thus it works as a bridge between the development and IT operations. From Site reliability engineering Google point of view, it is a strict process where they split the total time between development activities and operational / on-call activities. Google emphasizes on 50% time spending on operational aspects. Beyond this limit, they consider that the system is with an ill-health.

Also, Site reliability engineering is a specific approach mainly for large scale and cloud-native system and their IT operations. Using SLO (Service Level Objectives), which is a part of SLA, SRE model initiates productive interactions between the SRE team and development team.

An error budget is another vital concept and factor works here, which balances the productivity of the application and makes it reliable. As per the Google’s statement, as part of the process, the business must establish the availability target for a system. Once the team achieves that, one minus the availability target is considered as the error budget. For example, if the availability target is 99.99%, that means 0.01% unavailability. So, this .01% unavailability is a budget which development team can spend on anything they want. However, it should not be an overspend.SRE is also a collaborative approach where product developers can get assurance that the designed solution is responsive to non-functional requirements. This may include –

  • Availability
  • Performance
  • Security
  • Maintainability
  • Release management like cross-checking the efficiency of software delivery pipeline.

Typical responsibilities which come under site reliability engineering are:

  • Proactively monitoring application performance
  • Reviewing application performance
  • Handling on-call and emergency support
  • Checking logging and diagnostics
  • Creating and maintaining operational runbooks
  • Helping on escalated support tickets
  • Working on support tickets to resolve defects and other development tasks

How does SRE work?

Site reliability engineering Google team describes it in the following way –

“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor. We have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work.”

In other words, the activities of the SRE team are very much similar to the operation team which involves –

  • Verifying system availability
  • Checking performance
  • Latency measurement
  • System monitoring
  • Measuring the efficiency
  • Emergency response
  • Change management
  • Capacity planning

Site reliability engineering automates the above IT operations using programming languages, algorithms, and data structures. Not to mention, these are the expertise areas of software engineering.

Moreover, SRE model works in a balanced mode where it maintains several metrics and team dynamics. In a typical SRE model –

  • Product development teams start their own services, which include on-call for incidents.
  • When the service reaches an optimum level or high-traffic state, the development team asks support from the SRE. The SRE team takes on the running service in production.
  • The product owner defines a service-level objective (SLO) based on the acceptable downtime.
  • The acceptable and available downtime is the error budget for the service. Development team utilizes and spend this error budget for various purposes. For example, to try new features, improving operational ability, etc. However, if the service downtime goes down more than budget time, the development team can’t perform any further changes.However, SRE creates a very powerful dynamic in the overall process. It not only addresses operational problems rapidly but also keeps product owners honest about required SLO.

SRE vs. DevOps
In the context of Site reliability engineering, no doubt, SRE vs. DevOps is an unavoidable discussion point. As there are many operational similarities between the two, many demands that SRE is a replacement of DevOps. However, there is a sheer difference between the two approaches.

  • Traditionally, DevOps is more about a collaboration between developer and operations. Besides, it has also focused more on deployments. On the contrary, Site reliability engineering focuses more on operations and monitoring.
  • Where DevOps and SRE go hand in hand, DevOps helps in configuration, deployment, racking, etc. of servers and applications. The site reliability engineers can handle the daily operation after the setup is done.

It is no unusual that a company can use both DevOps and SRE, specifically if it is not using the cloud.

What are the tools used for SRE?

There’s no specific single SRE toolset. But any organization who looks to build out an SRE function should define tools itself. Both processes and tools are vital along with standardization and automation for scalability, repeatability, and other reasons.

Skills required for SRE

SRE is not a line item, and getting the right skill set for it is also a challenge in the market. No doubt, it is a high-skill activity, and there is a short supply of SRE experts in the market. It is an unusual mix of talent where you need in-depth technical knowledge along with customer-focused attention to SLO and error budget. Here one important point to remember is – we must consider IT operations as a value center rather than a subject to cost reduction. So, from that context, IT operation can value a company by maximizing revenue and avoiding downtime.

Thus, SRE demands in a professional –

  • Overall professional high-skill
  • Experience
  • Commitment
  • Automation skill
  • Skills and experience in both Software and System Engineering

SRE as a service

SRE comes with lots of benefits, but it demands a specialized skill set which is expensive. Though Google is using SRE by its in-house team, however, SRE-as-a-service is an emerging area for many large organizations. Some capable outsourced managed service providers are performing this. Not to mention, the SRE-as-a-service model is little unusual with in-house DevOps approaches. However, considering many operating procedures of SRE, for example, the SLO it is a cost effective solution. The SLO and other standard operating procedures that work at the heart of the SRE approach perform well to a commercial contract. However, these contracts are quite different from typical outsourced IT operations contracts. Furthermore, a managed SRE contract includes clear terms.

How managed SRE help in the process? Well, an SRE provider helps the development team to improve the operations before production release through a time-and-materials arrangement. Also, sometimes managed SRE uses the tools to automate the standard IT Ops needed to run the software in production.

Final thought

SRE is a great way to avoid spending a lot of time chasing bugs. Today the role of a software developer is not limited in coding. Instead, they actively take part in software deployments, application monitoring, and production operations. One of the primary reasons behind it is the availability of tools, which make it extremely easy to deploy applications and monitor them. IT operations today not only exist in most medium to large enterprises but also their types of works are in continuous change. They are moving with the latest technologies like cloud, containers, PaaS, and other technologies. Thus including SREs to a new release can help in understanding the changes in the new release of a project. In addition to that, it can expedite troubleshooting for the problems associated with a release.