Site Reliability Engineering: Measuring and Managing Reliability

Course Code: AT-SRE

Duration: 2 Days

Price: Contact For Pricing

e-Learning

Learn at your own pace with anytime, anywhere training.

Classroom Schedule

There are no classes currently scheduled

Virtual Schedule

Location Delivered By Language Date Price Action

Request Private Training

Tell us a little about yourself:

Course Description

Service level indicators (SLIs) and service level objectives (SLOs) are fundamental tools for measuring and managing reliability. In this course, you'll learn how to create appropriate SLIs and SLOs and how to use an error budget to manage reliability.

Objectives

  • Learn the best practices of Google SRE.
  • Know the definition of SLOs, SLIs and SLAs and how they impact reliability.
  • Understand how to set SLOs and SLIs.
  • Understand error budgets.
  • Analyze Risks associated with SLOs and the consequences of missing SLOs.

Audience

This class is primarily intended for the following participants:

  • DevOps specialists.
  • Software developers.
  • Product managers and application owners.
  • IT business decision makers.

Prerequisites

To get the most out of this course, participants should have:

  • Familiarity with the development cycle of cloud applications.
  • Familiarity with managing the response to outages.

Content

The course includes presentations, demonstrations, and hands-on labs.

 

Module 1: Introduction to Site Reliability Engineering

  • Introduction to Site Reliability Engineering.
  • Understand the course objectives and overall structure.
  • Understand the principles that underlie Site Reliability Engineering.

 

Module 2: Targeting Reliability

  • Definition of SLAs and SLOs.
  • Defining 'good enough' reliability.
  • What to consider when setting SLOs for your application in your organization,

 

Module 3: Operating for Reliability

  • Trading Reliability v. Features.
  • Understanding Error budgets.
  • Understand the trade-offs in having multiple SLIs for a given application.

 

Module 4: Choosing a Good SLI

  • Defining types of SLIs.
  • How to formulate SLI specifications.
  • Setting targets for those SLIs.

 

Module 5: Developing SLOs and SLIs

  • Setting SLO, SLI and Error Budgets in a sample application.
  • How to go from a user journey to an SLI implementation and an SLO target using a four step process.

 

Module 6: Quantifying Risks to SLOs

  • Characterizing and Analyzing Risk to SLOs.
  • Resources for learning more about data analysis, machine learning, business process analysis, and optimization.
  • Model risks in terms of time-between-failures, time-to recovery and impact percent.
  • Estimate the error budget cost of each risk using our Risk Analysis spreadsheet.
  • Meet a desired SLO target by trading off engineering work to mitigate risks.

 

Module 7: Consequences of SLOs Misses

  • Documenting SLOs and Developing Error Budget Policies.
  • Understand how to record and present SLI vs SLO data in a useful structure and format.
  • Be able to list the essential components of an error budget policy.
  • Enumerate possible options for reaction to an error budget overspend.