top of page

Top Incident Management KPIs


  • MTTA (Mean time to acknowledge): Monitoring MTTA over time shows you how efficient you are at responding to an incident. How long does it take for an engineer to receive a notification and begin working on the issue? Are there any problems with routing alerts to the person who needs to acknowledge an issue?


  • MTTR (Mean time to resolution): By measuring MTTR, you can determine how well you’re responding to an incident. The difference between MTTA and MTTR will show you exactly how quickly you’re acknowledging an incident and how long it takes to actually fix the problem.


  • Average Incident Response Time: How much time is your team spending navigating an incident and routing it to the right person? On average, incident response accounts for 73% of an incident’s lifecycle. So, you can see how shortening the average time spent in the incident response phase will result in a vastly more efficient incident management process.


  • Total Number of Incidents: How often are you receiving alerts that turn into incidents? If you have a high number of incidents, why is that the case? Are the alerts unactionable and simply causing alert fatigue, or do you need to work on a solution for the actual problem, not just a patch or a quick fix? Monitoring how many incidents are coming through the pipeline is a great measurement for determining your system’s health.


  • Percentage of Incidents Resolved in a Defined Timeframe: Define a timeframe that would equate to successful incident remediation for your team. Then, monitor the percentage of incidents resolved in that timeframe. Setting this timeframe provides a benchmark to reach for and measure against as you work to shorten the incident lifecycle.


  • Amount of Downtime, Percentage of Unavailability: Now this is a big one. You need to understand how often your system experiences downtime, the costs associated with downtime, and how often this affects customers. The only way to address a problem is by acknowledging that you have one. If you don’t track your percentage of unavailability, you have no way to know how reliable your system is.


  • Time Spent On-Call: Looking at how much time individuals spend on-call, and the times of day they’re put on-call, can show you who’s bearing the brunt of on-call responsibilities. Is one person handling numerous unactionable alerts at 3 AM while another user rarely needs to respond to an incident? Try to use this data to divvy up on-call responsibilities and make on-call easier for everybody.


  • Average Time Between Incidents: This is an excellent metric for tracking the reliability of your system over time. It can show whether you’re in a reactive vs. proactive incident management state. The larger the average time between incidents, the more time your team has to spend building reliability into new functionality, rather than simply responding to alerts.


  • Escalation Rate: The escalation rate will expose how often alerts are getting to the correct person. If incidents are escalated frequently, then you likely need to tweak your alert routing rules or re-think who’s on-call for certain issues. Escalation capability is essential for collaborative incident management teams, but you’d like alerts to reach the correct person on the first try as often as possible.


  • Post-Incident Reviews: Post-incident reviews aren’t a concrete KPI. But it’s important to record all your top incident management KPIs and roll them into comprehensive post-incident reviews. PIRs should be conducted to find weaknesses in your people operations, processes, or tooling. Then, you can quantitatively measure these important KPIs against benchmarks to see that your incident management techniques continue to improve.

16 views0 comments
bottom of page