Chargement en cours…

Incident Metrics in SRE

Site reliability engineers often use MTTx metrics to evaluate improvements or track trends. But is either MTTR ( mean time to recovery ) or MTTM ( mean time to mitigation ) ideal for decision making or trend analysis when it comes to production incidents? This report not only demonstrates how and wh...

Description complète

Détails bibliographiques
Auteur principal: Davidovič, Štěpán (Auteur)
Collectivité auteur: Safari, an O'Reilly Media Company
Format: Électronique eBook
Langue:Inglés
Publié: O'Reilly Media, Inc., 2021.
Édition:1st edition.
Accès en ligne:Texto completo (Requiere registro previo con correo institucional)
Description
Résumé:Site reliability engineers often use MTTx metrics to evaluate improvements or track trends. But is either MTTR ( mean time to recovery ) or MTTM ( mean time to mitigation ) ideal for decision making or trend analysis when it comes to production incidents? This report not only demonstrates how and why MTTx metrics come up short but also proposes ways to think about metrics differently to get the answers you want. Google SRE Štěpán Davidovič uses a Monte Carlo simulation to show you how poorly MTTx metrics perform with production incidents. Applying these metrics is trickier than it seems and can be dangerously misleading in many practical scenarios. With this report, you'll explore alternative methods for achieving these measurements. Work with a simple model of the incident lifecycle and timings using empirical datasets Use an analytical approach to get a clear picture of what your incident durations look like Focus on narrow questions of the incident lifecycle rather than analyze incident statistics using MTTx Explore alternative methods for achieving your measurements.
Description matérielle:1 online resource (34 pages)