Cargando…

The site reliability workbook : practical ways to implement SRE /

An expansion on the understanding of Google SRE, providing 'worked examples' for each essential facet of this area of IT prepared in co-operation with Google cloud customers based on their experiences. Instructs on methodology for running services at scale and starting SRE in greenfield or...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Otros Autores: Beyer, Betsy (Editor ), Murphy, Niall Richard, Rensin, David K., Kawahara, Kent, Thorne, Stephen
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Sebastopol, CA : O'Reilly Media : O'Reilly Media, 2018.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • How SRE relates to DevOps
  • Foundations. Implementing SLOs
  • SLO engineering case studies
  • Alerting on SLOs
  • Eliminating toil
  • Simplicity
  • Practices. On-call
  • Incident response
  • Postmortem culture: learning from failure
  • Managing load
  • Introducing non-abstract large system design
  • Data processing pipelines
  • Configuration design and best practices
  • Configuration specifics
  • Canarying releases
  • Processes. Identifying and recovering from overload
  • SRE engagement model
  • SRE: reaching beyond your walls
  • SRE team lifecycles
  • Organizational change management in SRE
  • A. Example SLO document
  • B. Example error budget policy
  • C. Results of postmortem analysis.
  • Intro; Copyright; Table of Contents; Foreword I; Foreword II; Preface; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Chapter 1. How SRE Relates to DevOps; Background on DevOps; No More Silos; Accidents Are Normal; Change Should Be Gradual; Tooling and Culture Are Interrelated; Measurement Is Crucial; Background on SRE; Operations Is a Software Problem; Manage by Service Level Objectives (SLOs); Work to Minimize Toil; Automate This Year's Job Away; Move Fast by Reducing the Cost of Failure; Share Ownership with Developers
  • Use the Same Tooling, Regardless of Function or Job TitleCompare and Contrast; Organizational Context and Fostering Successful Adoption; Narrow, Rigid Incentives Narrow Your Success; It's Better to Fix It Yourself; Don't Blame Someone Else; Consider Reliability Work as a Specialized Role; When Can Substitute for Whether; Strive for Parity of Esteem: Career and Financial; Conclusion; Part I. Foundations; Chapter 2. Implementing SLOs; Why SREs Need SLOs; Getting Started; Reliability Targets and Error Budgets; What to Measure: Using SLIs; A Worked Example
  • Moving from SLI Specification to SLI ImplementationMeasuring the SLIs; Using the SLIs to Calculate Starter SLOs; Choosing an Appropriate Time Window; Getting Stakeholder Agreement; Establishing an Error Budget Policy; Documenting the SLO and Error Budget Policy; Dashboards and Reports; Continuous Improvement of SLO Targets; Improving the Quality of Your SLO; Decision Making Using SLOs and Error Budgets; Advanced Topics; Modeling User Journeys; Grading Interaction Importance; Modeling Dependencies; Experimenting with Relaxing Your SLOs; Conclusion; Chapter 3. SLO Engineering Case Studies
  • Evernote's SLO StoryWhy Did Evernote Adopt the SRE Model?; Introduction of SLOs: A Journey in Progress; Breaking Down the SLO Wall Between Customer and Cloud Provider; Current State; The Home Depot's SLO Story; The SLO Culture Project; Our First Set of SLOs; Evangelizing SLOs; Automating VALET Data Collection; The Proliferation of SLOs; Applying VALET to Batch Applications; Using VALET in Testing; Future Aspirations; Summary; Conclusion; Chapter 4. Monitoring; Desirable Features of a Monitoring Strategy; Speed; Calculations; Interfaces; Alerts; Sources of Monitoring Data; Examples
  • Managing Your Monitoring SystemTreat Your Configuration as Code; Encourage Consistency; Prefer Loose Coupling; Metrics with Purpose; Intended Changes; Dependencies; Saturation; Status of Served Traffic; Implementing Purposeful Metrics; Testing Alerting Logic; Conclusion; Chapter 5. Alerting on SLOs; Alerting Considerations; Ways to Alert on Significant Events; 1: Target Error Rate ≥ SLO Threshold; 2: Increased Alert Window; 3: Incrementing Alert Duration; 4: Alert on Burn Rate; 5: Multiple Burn Rate Alerts; 6: Multiwindow, Multi-Burn-Rate Alerts; Low-Traffic Services and Error Budget Alerting