SRE: Boost cloud resiliency and value

(PDF-520 KB)

Many organizations use public cloud technology to reduce costs and improve business agility, innovation, and resilience. Gen AI is adding even more value to the estimated $3 trillion in EBITDA value from cloud by 2030.¹

However, many organizations are still working to fully realize the benefits of cloud transformations. In many cases, simply transferring existing models (such as waterfall and ticket-based plan-build-run infrastructure) to the cloud can result in limited value creation or even value destruction (for example, by relying on highly manual processes). As a result, most cloud leaders have learned that modernizing assets on the cloud can enable greater value and benefits than simply “lifting and shifting” assets as-is to the cloud.

Key types of development approaches

The following practices can be used in a modern IT infrastructure operating model:

A product operating model for infrastructure applies product management principles to infrastructure services, treating them as internal products that serve developers and business teams. Instead of a traditional project-based, ticket-driven approach, infrastructure teams operate as cross-functional squads that own the design, development, and life cycle of infrastructure services such as compute, storage, networking, data, and developer platforms.
Agile software development is an iterative approach that emphasizes flexibility, collaboration, and continuous delivery of small, incremental improvements. Agile methodologies (such as Scrum or Kanban) prioritize customer feedback, adaptability to change, and working software over rigid plans. Agile teams operate in sprints (time-boxed iterations) and focus on delivering value early and often.
Platform engineering is the discipline of designing and building internal platforms that provide standardized, self-service infrastructure, tools, and workflows to enable developer productivity and operational efficiency. It involves automating infrastructure provisioning, continuous integration and continuous delivery (CI/CD) pipelines, security, and observability to reduce friction in software delivery. Platform teams typically provide golden paths (predefined best practices) to developers while ensuring reliability, scalability, and compliance.
Site reliability engineering (SRE) is a software engineering discipline focused on improving the reliability, availability, and performance of large-scale systems through automation, observability, and operational excellence. SRE teams apply software engineering principles to IT operations, reducing toil through automation, defining service-level objectives, and using techniques such as blameless postmortems, chaos engineering, and error budgets to balance feature delivery with system stability.
DevSecOps (development, security, and operations) is a security-first approach to DevOps that integrates security practices into the entire software development life cycle. Instead of treating security as a separate phase, DevSecOps embeds automated security testing, compliance checks, and threat modeling into CI/CD pipelines. Key practices include “shift-left security” (integrating security practices earlier in the development process), infrastructure-as-code security scanning, vulnerability management, and runtime security monitoring, ensuring that applications are secure by design.

To adapt to the cloud, most leading organizations adopt a best-in-class product and platform model to establish a modernized operating model for IT infrastructure.² These models typically involve two key parts. First, they include platform engineering for infrastructure, with services managed as products and delivered through self-service APIs and as software through code pipelines. Second, they use site reliability engineering (SRE),³ which uses software engineering practices and automation to manage application and infrastructure operations more effectively. (See sidebar, “Key types of development approaches.”)

In this article, we explore how to scale SRE in application operations as part of the product and platform operating model. If done successfully, this can help companies achieve greater benefits from their cloud applications and improve their delivery speed, reliability, and efficiency. In our experience, leading enterprises can achieve 60 to 70 percent of their desired financial goals (depending on their current level of maturity and adoption) by integrating SRE best practices into their technology migration to transform the operating model on cloud.

Making the most of cloud migrations with SRE

As infrastructure moves to the cloud, organizations have an opportunity to rethink processes and team structures. SRE practices emphasize operational excellence, system reliability, and automation across all layers of IT application, operations, and infrastructure. These practices can lead to significant improvements in operational productivity, tooling, platform synergies, speed, resilience, quality, security, and user experience (Exhibit 1).

Many organizations that adopt SRE, however, do not realize its full potential, because they adopt only part of the model. Common failure modes include the following:

assigning traditional operational support staff to become SRE experts without the right skill or without providing necessary automation
embedding SRE experts into application teams without defining a clear operating model and responsibilities, resulting in “finger-pointing,” simply handing issues to SRE experts, and expensive operational teams
keeping SRE teams entirely separate from application teams and not providing them with the authority to push back on software that does not meet the organization’s standards for quality, operations, and resilience
focusing SREs primarily on reactive manual activities and not prioritizing automation and engineering to reduce demand and operational toil

Key steps for successful SRE implementation

Leading organizations successfully adopt the SRE model when they take a holistic approach to creating integrated SRE teams, modernizing IT service management (ITSM) processes, using platform engineering to increase automation, supporting continuous change with specialized talent, and using holistic (rather than ticket-based) metrics to evaluate outcomes (Exhibit 2). Below, we outline the journey of implementing SRE successfully, from choosing the SRE operating model to how to manage toward final outcomes.

1. Choosing an SRE operating model

Implementing SRE starts with designing an integrated operating model that brings together application, operations, and infrastructure functions. This involves close collaboration with engineering and architecture leaders to align the SRE model with broader strategies and business objectives.

There are three levels of SRE deployments (Exhibit 3):

Concierge: SRE teams provide operational support and automation as a service, without making independent changes.
Embedded: SRE teams gain more autonomy and flexibility, making code changes directly to the cloud platform.
Partners: Mature SRE teams build proprietary code and new cloud platform automations, sharing them for reuse across the organization.

2. Modernizing ITSM operational processes for cloud

Organizations that simply move legacy processes to the cloud can miss out on potential benefits and retain existing operational friction. SRE involves reimagining ITSM processes to improve the delivery and outcomes of IT services. This is especially important during hybrid cloud transitions, in which divided processes can create more work for cloud-based teams.

Leaders focus on reimagining processes across four categories: design and provisioning, operational management, release and transmission, and service management. Transforming these processes involves incorporating more automation and streamlined workflows. With incident management, for example, rather than relying on manual detection and resolution of platform issues, automation systems are built with streamlined processes to proactively detect, triage, and address issues for application teams.

For example, imagine a top financial-services company that initially “lifted and shifted” its old IT processes directly into its new cloud systems. The company realized it could improve these systems with significantly streamlined processes that incorporate end-to-end automation. By adopting a DevSecOps approach, the company could redesign its IT processes on the cloud using an automated CI/CD pipeline. Tickets would be automatically generated and updated throughout the pipeline, eliminating the need to manually create them. The new system would also provide full tracking of changes and the ability to undo changes, giving teams better control and making audits easier.

Cloud by McKinsey

Read the insights

3. Maturing toward platform engineering to eliminate toil

Cloud teams often deliver composite cloud services to improve automation, but significant work remains for developers to configure, integrate, and consume APIs for their applications (20 to 30 percent or more of a full cloud foundation). Full SRE transformations involve platform engineering to create complete end-to-end self-service with fully automated systems, which reduce operational toil and improve the developer experience. This includes developer platforms, observability tools, and CI/CD pipelines with built-in testing, security, and compliance functions.⁴

4. Shifting to software and cloud engineering talent for operations

Transitioning traditional infrastructure and operations groups to the SRE model can be challenging. Employees need to learn how to use cloud computing services, infrastructure as code, and continuous-delivery pipelines. Leading organizations typically need to implement a range of talent strategies to build new talent and skills within the organization.⁵ They often start by providing learn-by-doing SRE apprenticeship programs with boot camps to train teams on the new operating model and tools as well as provide coaching for more complex skills. A proven model to scale the capability building is a train-the-trainer approach, in which SRE champions are chosen from each team to both learn SRE skills and coach the rest of their teams. After an SRE boot camp, leading organizations encourage continuous learning with metrics that track team development.

5. Managing toward outcomes rather than ticket-based SLAs

Traditional service-level agreements (SLAs) manage risk and offer predictability but can still allow fluctuations in service availability. In an SRE environment, SLAs are supplemented with leading and lagging metrics. For example, an SLA might include service-level objectives (SLOs) of 99 percent of APIs served within 100 milliseconds, measured by service-level indicators (SLIs) of actual API response times, along with leading indicators of error rates and latency. By integrating these metrics, SRE-style SLAs create transparency and balance speed, quality, and resilience. Better measurement systems enable new possibilities, such as embedding SRE-related outcomes into teams’ measurable goals, offering real-time reporting of metrics to identify areas that need intervention, and reflecting progress in teams’ quarterly business reviews.

An important prerequisite for this shift is to establish data foundations to measure progress accurately. It’s also important to establish partnership and alignment between technology and business leaders on desired metrics, thresholds, and goals.

Planning an SRE transformation

SRE journeys follow standard phases, but how well they are implemented can determine success or failure (Exhibit 4). Implementing SRE and cloud practices requires a disciplined commitment to change the ways of working, skills, and mindsets of teams. Cloud leaders need to communicate the value of the change to the organization, rewire how SRE teams work (including implementing and automating process changes), and provide tailored capability building programs supported by engineering leaders.

The rollout of the SRE transformation will require changes across the operating model for application teams, operations, and infrastructure to be successful. For example, a company can evolve their tower-based infrastructure teams into “cloud product” teams that build complete end-to-end cloud services.

At the same time, traditional tooling teams can transition to full-platform engineering teams that support developer teams across the full software development life cycle by creating common tools, such as a developer portal that serves as a one-stop shop for developer services. They can create interfaces (APIs) that allow cloud products to be easily used by different business units. Platform engineering teams can also provide a self-service system to allow SRE teams to manage the full operational life cycle (including deploying all infrastructure and application code) with tools to quickly detect a system’s health and fix issues.

By transforming the operating model while migrating to the cloud, organizations can increase operational efficiencies by 20 to 25 percent, reduce cycle times up to 60 to 70 percent, and improve the resilience and security of their applications and platforms by more than 30 percent, according to McKinsey analysis.

Adopting an SRE model is critical to driving operational excellence and achieving benefits along the cloud journey. Successful SRE transformations ensure the adoption of automation and SRE practices, allow for enterprise-wide measurement of value delivery, and drive behavior and culture change.

Unlocking cloud value: Achieving operational excellence through SRE

About the authors

Key types of development approaches

Making the most of cloud migrations with SRE

Key steps for successful SRE implementation

1. Choosing an SRE operating model

2. Modernizing ITSM operational processes for cloud

Cloud by McKinsey

3. Maturing toward platform engineering to eliminate toil

4. Shifting to software and cloud engineering talent for operations

5. Managing toward outcomes rather than ticket-based SLAs

Planning an SRE transformation

Explore a career with us

Related Articles

Ending the confusion in cloud transformations: The dashboards and metrics everyone needs

The big product and platform shift: Five actions to get the transformation right

In search of cloud value: Can generative AI transform cloud ROI?