Understanding Problem Management in ITSM
What is problem management?
Problem management is the process of identifying and managing the causes of incidents on an IT service. It is a core component of ITSM frameworks.
The closer you get to real incident experts, the less you actually hear the question: “What caused the incident?” Sure, you’ll hear it plenty from executives, and customers, and the press. But the experts know better.
Because the answer to “what caused the incident” is often dry and non-helpful: a rewritten config file, a corrupted database entry.
But what were the contributing causes behind the thing that caused the incident? What were the factors that led up to the incident? How is it possible that a config file could be rewritten? What conditions create a corrupted database entry? These are the questions you hear experts ask. And they’re at the heart of problem management.
Problem management isn’t just about finding and fixing incidents, but identifying and understanding the underlying causes of an incident as well as identifying the best method to eliminate that root cause. Moreover, pinpointing the cause has no value to an organization if it’s a cut-off process completed by a siloed team, so problem management should be constant and widely practiced across multiple teams, including IT, security, and software developers. An incident may be over once the service is up and running again, but until the underlying causes and contributing factors are addressed, the problem remains.
The relationship between problem management and other key ITIL processes
Problem management works alongside incident management and other ITIL practices to form an overall ITSM strategy.
Problem management vs. incident management
ITIL defines a problem as a cause, or potential cause, of one or more incidents. The behaviors behind effective incident management and effective problem management are often similar and overlapping, but there are still key differences. For example, rolling back a recent deployment may get the service operating again and end the incident, but the underlying problem remains.
That said, we believe that problem management and incident management practices are becoming increasingly intertwined. During the times between incidents, IT teams can focus their efforts on problem investigations that lead to improvements and better service quality. This is how problem management becomes the most valuable to the organization.
Problem management and change management
Change management is the process of planning, tracking, and releasing changes without service disruption or downtime.
When a change does cause disruption or downtime, that change is analyzed during incident and problem management processes.
Problem management and knowledge management
Knowledge management creates a repository of solutions and documentation for common procedures and even incident workarounds. When used together, a healthy knowledge management practice can enable faster incident resolution and fewer incidents altogether.
Problem management and service request management
Service request management is the practice of processing a request from a user for something to be provided, such as access to applications, software enhancements, and information. It can sometimes be difficult to distinguish a service request from an incident. In fact, the two were not distinguished and both lumped into the category “incidents” until the release of ITIL V3 in 2007. ITIL now defines an incident as ‘an unplanned interruption to an IT service or reduction in the quality of an IT service.’ It defines a service request as “a formal request from a user for something to be provided – for example, a request for information or advice; to reset a password, or to install a workstation for a new user.”
What are the benefits of problem management?
Done right, problem management unleashes many benefits for the business.
Decrease time to resolution
Teams that unlock the problems behind today’s incidents will be better prepared to attack incidents in the future. By codifying best practices around problem analysis, teams will be able to more quickly respond and take action during the next service disruption.
Avoid costly incidents
Avoiding incidents will save time, money, and lots of pain. According to Gartner, many organizations report downtime costing more than $300,000 per hour. For some web-based services, that number can be dramatically higher.
Stop responding to incidents so frequently and return resources and time to teams who could be shipping new value to customers.
Empower your team to find and learn from underlying causes
When organizations effectively practice problem management, teams continually investigate, learn from incidents, and ship valuable updates. Unfortunately, many enterprises create a siloed problem management team that is too far removed from day-to-day operations to eliminate the most pressing problems.
Promote continuous service improvement
Problem management prevents incidents and also delivers value. For instance, fixing an incident causing low-level performance also ships valuable service quality improvements.
Increase customer satisfaction
Better problem management leads to fewer incidents and happier customers. Alternatively, customer patience wears thin when they notice the same incident happening multiple times. Decreasing the occurrence of repeat incidents builds customer trust.
The problem management process
At Atlassian, we advocate bringing the problem and incident management processes closer together.
When problem management is a heavy, siloed, and separate process, companies can end up creating a dumping ground of problems. This backlog is where problem issues go to die in some teams. It’s best to get problems in front of the teams that can handle and do valuable investigations.
That all being said, it’s good to understand the main steps that contribute to a problem management process. Such as:
Problem detection - Proactively find problems so they can be fixed, or identify workarounds before future incidents happen.
Categorization and prioritization - Track and assess known problems to keep teams organized and working on the most relevant and high-value problems.
Investigation and diagnosis - Identify the underlying contributing causes of the problem and the best course of action for remediation.
Create a known error record - In ITIL, a known error is “a problem that has a documented root cause and a workaround.” Recording this information leads to less downtime if the problem triggers an incident. This is typically stored in a document called a known error database.
Create a workaround, if necessary - A workaround is a temporary solution for reducing the impact of problems and keeping them from becoming incidents. These aren’t ideal, but they can limit the business impact and avoid a customer-facing incident if the problem can’t be easily identified and eliminated.
Resolve and close the problem - A closed problem is one that has been eliminated and can no longer cause another incident.
Problem management best practices and tips
Like we mentioned earlier, the most effective problem management teams we’ve seen blend problem management and incident management.
Setting problem management as a separate practice creates a challenge where the problem team becomes a bottleneck or focuses on the wrong things, like problems from external vendors that they have no control over. Root causes are often not investigated until long after the incident has happened.
In many cases, your team may benefit from integrating incident management and problem management practices. This is a proactive approach that allows you to understand what led to the incident at the same time you work to resolve it. For example, resolving an incident in software requires identifying poor code (the cause), and then developing replacement code to avoid further incidents (the fix).
Weaving problems and incidents together means when teams aren’t in response mode they can look to problems that are most impacting service and performance quality and get ahead, to prevent future incidents.
Problem management tips
Avoid relying on reactive, root-cause analysis
There is rarely just one root cause behind an incident or problem. The best teams holistically consider all potential contributing factors and practice blameless analysis.
Encourage an open environment where problems are shared
Problem and incident analysis should be an open conversation where team members are encouraged to share the facts without fear of punishment or retribution.
Focus on critical services
Prioritize addressing the problems affecting the services that deliver the most value to the organization.
Ask questions and use the ‘5 whys’
Many teams find success using the “5 Whys” technique Taiichi Ohno, the architect of the Toyota Production System. Check out the Atlassian Team Playbook play to learn more.
Spread the knowledge
Open teams share knowledge and insights that their colleagues and adjacent teams can learn from.
Become a learning organization
Effective problem management isn’t something with an end date. Even the best-performing organizations have incidents. The true world-class teams are the ones who constantly iterate on their process, improve it, and lessen the impact of problems on their colleagues and customers.
It’s important to develop a clear and standardized way to stay on top of follow-up actions. Since you should always be practicing problem management, it’s important to use ITSM software that will enable your team to prioritize tasks, track progress, and help associate incident issues with problems.
Incidents are often described as an unplanned investment in the future reliability of your service. Effective problem management delivers valuable service improvements while identifying and eliminating the driving forces behind incidents.