Unlocking cloud value: Achieving operational excellence through SRE

| Article

Many organizations use public cloud technology to reduce costs and improve business agility, innovation, and resilience. Gen AI is adding even more value to the estimated $3 trillion in EBITDA value from cloud by 2030.1

However, many organizations are still working to fully realize the benefits of cloud transformations. In many cases, simply transferring existing models (such as waterfall and ticket-based plan-build-run infrastructure) to the cloud can result in limited value creation or even value destruction (for example, by relying on highly manual processes). As a result, most cloud leaders have learned that modernizing assets on the cloud can enable greater value and benefits than simply “lifting and shifting” assets as-is to the cloud.

To adapt to the cloud, most leading organizations adopt a best-in-class product and platform model to establish a modernized operating model for IT infrastructure.2 These models typically involve two key parts. First, they include platform engineering for infrastructure, with services managed as products and delivered through self-service APIs and as software through code pipelines. Second, they use site reliability engineering (SRE),3 which uses software engineering practices and automation to manage application and infrastructure operations more effectively. (See sidebar, “Key types of development approaches.”)

In this article, we explore how to scale SRE in application operations as part of the product and platform operating model. If done successfully, this can help companies achieve greater benefits from their cloud applications and improve their delivery speed, reliability, and efficiency. In our experience, leading enterprises can achieve 60 to 70 percent of their desired financial goals (depending on their current level of maturity and adoption) by integrating SRE best practices into their technology migration to transform the operating model on cloud.

Making the most of cloud migrations with SRE

As infrastructure moves to the cloud, organizations have an opportunity to rethink processes and team structures. SRE practices emphasize operational excellence, system reliability, and automation across all layers of IT application, operations, and infrastructure. These practices can lead to significant improvements in operational productivity, tooling, platform synergies, speed, resilience, quality, security, and user experience (Exhibit 1).

Image description: A series of icons shows the financial and nonfinancial benefits of SRE operating models. These models can improve operational productivity by 20 to 30 percent or more; DevOps tools and platform synergies by 30 to 40 percent; speed by 50 percent or more; resilience by 30 to 50 percent or more; experience and productivity by 30 percent; quality by 30 to 50 percent or more; and security by 50 percent or more. Source: McKinsey analysis End of image description.

 

Many organizations that adopt SRE, however, do not realize its full potential, because they adopt only part of the model. Common failure modes include the following:

  • assigning traditional operational support staff to become SRE experts without the right skill or without providing necessary automation
  • embedding SRE experts into application teams without defining a clear operating model and responsibilities, resulting in “finger-pointing,” simply handing issues to SRE experts, and expensive operational teams
  • keeping SRE teams entirely separate from application teams and not providing them with the authority to push back on software that does not meet the organization’s standards for quality, operations, and resilience
  • focusing SREs primarily on reactive manual activities and not prioritizing automation and engineering to reduce demand and operational toil

Key steps for successful SRE implementation

Leading organizations successfully adopt the SRE model when they take a holistic approach to creating integrated SRE teams, modernizing IT service management (ITSM) processes, using platform engineering to increase automation, supporting continuous change with specialized talent, and using holistic (rather than ticket-based) metrics to evaluate outcomes (Exhibit 2). Below, we outline the journey of implementing SRE successfully, from choosing the SRE operating model to how to manage toward final outcomes.

Image description: A framework chart shows the five elements needed for holistic SRE models on cloud. The first element is an integrated SRE model alongside the cloud migration. This brings together application and infrastructure support into the SRE function. The second element is modernization of ITSM—or operational—processes for cloud. The third element is platform engineering to increase automation and eliminate toil. This involves establishing a unified set of platform engineering tools. The fourth element is ample software engineering talent for operations, which requires driving continuous learning and upskilling. The fifth element is governance using data-driven metrics, including a holistic set of leading and lagging metrics. Source: McKinsey analysis End of image description.

1. Choosing an SRE operating model

Implementing SRE starts with designing an integrated operating model that brings together application, operations, and infrastructure functions. This involves close collaboration with engineering and architecture leaders to align the SRE model with broader strategies and business objectives.

There are three levels of SRE deployments (Exhibit 3):

  • Concierge: SRE teams provide operational support and automation as a service, without making independent changes.
  • Embedded: SRE teams gain more autonomy and flexibility, making code changes directly to the cloud platform.
  • Partners: Mature SRE teams build proprietary code and new cloud platform automations, sharing them for reuse across the organization.
Image description: A table outlines three types of SRE models: shared operations, aligned SRE and security, and integrated DevSeckOps. It shows how they compare regarding their goals, structure, SRE-to-developer ratio, and success factors. Shared operations involves integrating SRE professionals into product teams from a common pool of SRE talent. These professionals are accountable to the product teams. Aligned SRE and security involves integrating SRE talent into teams on a product group or family level, with SRE professionals serving similar product teams. Integrated DevSeckOps involves deploying SRE tools over the cloud for product teams to use, with more of a “you build you run” model with just a few shared SRE security teams. Each model has a smaller SRE professional to developer ratio. For example, in shared operations, one SRE professional is needed for every 5 developers. For aligned SRE and security, one SRE professional is needed for every 10 developers. And for integrated DevSeckOps, one SRE professional is needed for every 20 or more developers. Source: McKinsey analysis End of image description.

2. Modernizing ITSM operational processes for cloud

Organizations that simply move legacy processes to the cloud can miss out on potential benefits and retain existing operational friction. SRE involves reimagining ITSM processes to improve the delivery and outcomes of IT services. This is especially important during hybrid cloud transitions, in which divided processes can create more work for cloud-based teams.

Leaders focus on reimagining processes across four categories: design and provisioning, operational management, release and transmission, and service management. Transforming these processes involves incorporating more automation and streamlined workflows. With incident management, for example, rather than relying on manual detection and resolution of platform issues, automation systems are built with streamlined processes to proactively detect, triage, and address issues for application teams.

For example, imagine a top financial-services company that initially “lifted and shifted” its old IT processes directly into its new cloud systems. The company realized it could improve these systems with significantly streamlined processes that incorporate end-to-end automation. By adopting a DevSecOps approach, the company could redesign its IT processes on the cloud using an automated CI/CD pipeline. Tickets would be automatically generated and updated throughout the pipeline, eliminating the need to manually create them. The new system would also provide full tracking of changes and the ability to undo changes, giving teams better control and making audits easier.

Cloud by McKinsey

Cloud by McKinsey

3. Maturing toward platform engineering to eliminate toil

Cloud teams often deliver composite cloud services to improve automation, but significant work remains for developers to configure, integrate, and consume APIs for their applications (20 to 30 percent or more of a full cloud foundation). Full SRE transformations involve platform engineering to create complete end-to-end self-service with fully automated systems, which reduce operational toil and improve the developer experience. This includes developer platforms, observability tools, and CI/CD pipelines with built-in testing, security, and compliance functions.4

4. Shifting to software and cloud engineering talent for operations

Transitioning traditional infrastructure and operations groups to the SRE model can be challenging. Employees need to learn how to use cloud computing services, infrastructure as code, and continuous-delivery pipelines. Leading organizations typically need to implement a range of talent strategies to build new talent and skills within the organization.5 They often start by providing learn-by-doing SRE apprenticeship programs with boot camps to train teams on the new operating model and tools as well as provide coaching for more complex skills. A proven model to scale the capability building is a train-the-trainer approach, in which SRE champions are chosen from each team to both learn SRE skills and coach the rest of their teams. After an SRE boot camp, leading organizations encourage continuous learning with metrics that track team development.

5. Managing toward outcomes rather than ticket-based SLAs

Traditional service-level agreements (SLAs) manage risk and offer predictability but can still allow fluctuations in service availability. In an SRE environment, SLAs are supplemented with leading and lagging metrics. For example, an SLA might include service-level objectives (SLOs) of 99 percent of APIs served within 100 milliseconds, measured by service-level indicators (SLIs) of actual API response times, along with leading indicators of error rates and latency. By integrating these metrics, SRE-style SLAs create transparency and balance speed, quality, and resilience. Better measurement systems enable new possibilities, such as embedding SRE-related outcomes into teams’ measurable goals, offering real-time reporting of metrics to identify areas that need intervention, and reflecting progress in teams’ quarterly business reviews.

An important prerequisite for this shift is to establish data foundations to measure progress accurately. It’s also important to establish partnership and alignment between technology and business leaders on desired metrics, thresholds, and goals.

Planning an SRE transformation

SRE journeys follow standard phases, but how well they are implemented can determine success or failure (Exhibit 4). Implementing SRE and cloud practices requires a disciplined commitment to change the ways of working, skills, and mindsets of teams. Cloud leaders need to communicate the value of the change to the organization, rewire how SRE teams work (including implementing and automating process changes), and provide tailored capability building programs supported by engineering leaders.

Image description: A timeline describes the four phases of SRE rollouts. This includes a discovery and design sprint (about 6 to 8 weeks), lighthouse frontrunners (about 10 to 12 weeks), initial scaling (about 6 to 9 months) and sustained scaling (12 to 18 months). During the discovery and design sprint, companies align on a North Star vision for the SRE model, set success measures, select lighthouse candidate(s), and prioritize adoption road map(s). During the lighthouse frontrunners phase, companies conduct adoption sprints for SRE lighthouse models and accelerate development of priority enterprise tools and processes. During initial scaling, companies roll out SRE across different cohorts using sprints and train-the-trainer boot camps and scale up a full set of enterprise capabilities and processes. During the sustained scaling phase, companies drive continuous improvement of SRE processes and culture as teams mature and measure performance of teams and products across the enterprise. Source: McKinsey analysis End of image description.

The rollout of the SRE transformation will require changes across the operating model for application teams, operations, and infrastructure to be successful. For example, a company can evolve their tower-based infrastructure teams into “cloud product” teams that build complete end-to-end cloud services.

At the same time, traditional tooling teams can transition to full-platform engineering teams that support developer teams across the full software development life cycle by creating common tools, such as a developer portal that serves as a one-stop shop for developer services. They can create interfaces (APIs) that allow cloud products to be easily used by different business units. Platform engineering teams can also provide a self-service system to allow SRE teams to manage the full operational life cycle (including deploying all infrastructure and application code) with tools to quickly detect a system’s health and fix issues.

By transforming the operating model while migrating to the cloud, organizations can increase operational efficiencies by 20 to 25 percent, reduce cycle times up to 60 to 70 percent, and improve the resilience and security of their applications and platforms by more than 30 percent, according to McKinsey analysis.


Adopting an SRE model is critical to driving operational excellence and achieving benefits along the cloud journey. Successful SRE transformations ensure the adoption of automation and SRE practices, allow for enterprise-wide measurement of value delivery, and drive behavior and culture change.

Explore a career with us