What is incident management?
Incident management is the process of responding to an unplanned event or service interruption to restore the service to its operational state. According to ITIL (IT Infrastructure Library), “the incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized.”
Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses an even greater risk of complete failure.
For the sake of getting everyone on the same page, here are some quick definitions of related terms:
ITSM (IT service management) is a common approach to creating, supporting, and managing IT services. The core concept of ITSM is the belief that IT should be delivered as a service. And one of the core practices of ITSM is incident management.
ITIL is a set of best practices for ITSM (think of it as a playbook).
A problem is the not-yet-known root cause behind one or more incidents. In the incident above where the network is creeping and a business application is down, a misconfigured router could be the underlying problem behind both.
The importance of incident management as an ITSM practice
Considering all the software services organizations rely on today, there are more potential failure points than ever. And, the impact of an incident can be huge. Research says major incidents can cost $300,000 for every hour a system is down. For some web-based services, that number can be dramatically higher.
Having a well-defined incident management process can help reduce those costs dramatically. The benefits of a well-defined process include:
Faster incident resolution
Reduced costs or revenue losses for the organization resulting from incidents
Better communication—both internal and external—during incidents
Continuous learning and improvement
The incident management process
The key to incident management is having a good process and sticking to it. Even that can seem daunting, but the good news is that you can learn from thousands of other IT service teams' experiences.
One of the top mistakes of busy, growing IT organizations is to try to reinvent the wheel and create processes from scratch. Draw on best practices and don't waste time building a homegrown tool for fielding tickets.
Here is a high-level overview of the important steps for an incident management practice:
Identify an incident and log it
An incident can come from anywhere. An employee can call you to report it, or it can literally fall through the ceiling tile and land in your lap, in the case of an ill-placed network hub and a leaky roof. (Not that we’re speaking from experience...)
No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it.
If you receive the incident already logged via your service desk, these first two steps are already done for you. If you get a phone call or the incident is reported via email or text or courier pigeon, it’s the service desk team’s job to properly log it in to your service desk.
These incident logs (i.e., tickets) typically include:
The name of the person reporting the incident
The date and time the incident is reported
A description of the incident (what is down or not working properly)
A unique identification number assigned to the incident, for tracking
Categorize your incident
Assign a logical, intuitive category (and subcategory, as needed) to every incident. If you don’t, you’re cutting off your ability to later analyze your data and look for trends and patterns, which is a critical part of effective problem management and preventing future incidents. And make sure to choose an ITSM service desk solution that allows you to easily customize incident categories.
Prioritize your incident
Every incident must be prioritized. Start by assessing its impact on the business. Consider the number of people that will be impacted and the potential financial, security, and compliance implications. This will help you determine how much pain the incident is causing and how urgently the business must resolve it.
The best practice here is to define your severity and priority levels before an incident happens, making it simpler for incident managers to gauge priority quickly.
When you’re in doubt about the priority level, go with the higher one. It’s better to err on the side of caution than to let something severe fall through the cracks.
Once you’ve set those priorities, address all open incidents in order of priority. Most organizations set clear service agreements around each level of priority, so customers know how quickly to expect a response and resolution.
Incident response is a pretty broad term, so let’s break it down a bit further into the most likely steps you'll perform once you’ve identified, categorized, and prioritized an incident.
Think of this as the triage function that a hospital performs on new patients. The service desk employee is formulating a quick hypothesis around what is likely wrong, so they can either set about fixing it or follow the appropriate procedures and compile the right resources to get it resolved. Knowledge bases and diagnostic manuals are helpful tools at this step.
If the first agent to respond is able to resolve the incident based on their initial diagnoses and available knowledge and tools, the incident is resolved. Otherwise, it’s time to escalate.
Your front-line support team should be able to resolve a large number of the most frequent incidents without escalating. But for those who can’t, the goal is to gather and log the right information to help support get up to speed quickly, so they can resolve the incident promptly.
Investigation and diagnosis
ITIL calls this out as its own single step. In reality, it happens throughout the incident lifecycle.
The first support person to respond is already investigating, to an extent, when he or she collects information, and may even successfully diagnose and even resolve the incident without any escalation required. In that case, you’ve skipped directly through the next few steps: resolution and recovery, and incident closure.
Otherwise, investigation and diagnosis will happen at every step of the way as you escalate or bring in outside resources to consult and assist with the resolution.
Resolution and recovery
Eventually–and, ideally, within your established service level agreements (SLAs)–you will arrive at a diagnosis and perform the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
The incident is then passed back to the service desk (if it was escalated) to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.
The incident management process may seem unnecessarily formal, particularly if you are part of a smaller organization. Regardless of your team structure though, the incident lifecycle is still the same and escalations often need to occur. Don’t skip steps!
Incidents happen. But a strong incident management process means that you can reduce their impact and restore services quickly.