Problem management is the process which is responsible to manage the lifecycle of all problems. ITIL defines a ‘problem’ as an underlying cause of one or more incidents. The purpose of problem management is to manage the lifecycle of all problems from first identification through further investigation, documentation and eventual removal.
Problem management seeks to minimize the adverse impact of incidents and problems on the business that are caused by underlying errors within the IT Infrastructure and to proactively prevent recurrence of incidents related to these errors. In order to achieve this, problem management seeks to get the root cause of incidents, document and communicate the known errors and initiate actions to improve or correct the situation.
Objectives & Steps
The objectives of the problem management process are to:
Prevent problems and resulting incidents from happening
Eliminate recurring incidents minimize the impact of incidents that cannot be prevented
Problem management follows the below mentioned steps:
Problem detection it is likely that multiple ways of detecting problems will exist in all organizations. These can include triggers for reactive and proactive problem management.
Suspicion or detection of a cause of one or more incidents by the service desk, resulting in a problem record being raised. The desk may have resolved the incident but has not determined a definitive cause, and suspects that it is likely to recur, So will raise a problem record to allow the underlying cause to be resolved. Alternatively, it may be immediately obvious from the outset that an incident, or incidents, has been caused by a major problem, so a problem record will be raised without delay.
Analysis of an incident by a technical support group which reveals that an underlying problem exists, or is likely to exist. Analysis of incidents that result in the need to raise a problem record, so that the underlying fault can be investigated further.
Trending of historical incident records to identify one or more underlying causes that if removed, can prevent their recurrence. In this case, a problem record is raised once the underlying trend or cause is discovered.
Problem Logging a cross-reference must be made to the incident(s) which initiated the problem record and all relevant details must be copied from the incident record(s) to the problem record. It is difficult to be exact, as cases may vary, but typically this will include details such as:
User details Service details Equipment details Date/time initially logged Priority and categorization details Incident description
Incident record numbers or other cross- reference
Details of all diagnostic or attempted recovery actions taken.
Problems should be categorized in the same way as incidents (and it is advisable to use the same coding system), so that the true nature of the problem can be easily traced in the future, and meaningful management information can be obtained. This also allows for incidents and problems to be more readily matched.
Problems should be prioritized the same way using the same reasons as incidents. The frequency and impact of related incidents must also be taken into account.
Severity in this context refers to how serious the problem is from a service or customer perspective, as well as an infrastructure perspective. For example: Can the system be recovered, or does it need to be replaced? How much will it cost? How many people, with what skills, will be needed to fix the problem? How long will it take to fix the problem? How extensive is the problem (e.g. how many CIs are affected)?
Problem investigation and diagnosis:
At this stage, an investigation should be conducted to diagnose the root cause of the problem. The speed and nature of this investigation will vary depending upon the impact, severity and urgency of the problem. The CMS must be used to determine the level of impact and pinpoint and diagnose the exact point of failure.
The KEDB should also be accessed and problem-matching techniques (Such as keyword searches) should be used to see if the problem has occurred before and, if so, to find the resolution.
It is often valuable to try to recreate the failure to understand what has gone wrong, and then try various ways of finding the most appropriate and cost-effective resolution to the problem. It may be possible to recreate the problem in a test environment that mirrors the live environment. This allows for investigation and diagnosis activities to proceed effectively without causing further disruption to users.
In some cases it may be possible to find a workaround to the incidents caused by the problem. For example, a manual amendment may be made to an input file to allow a program to complete its run successfully and allow a billing process to complete satisfactorily, but it is important that work on a permanent resolution continues where this is justified. In this example the reason for the file becoming corrupted in the first place must be found and corrected to prevent this happening again.
When a workaround is found, it is therefore important that the problem record remains open and details of the workaround are documented within the problem record.
Raising a known error record: A known error is defined as a problem with a documented root cause and workaround. The known error record should identify the problem record it relates to and document the status of actions being taken to resolve the problem, its root cause and workaround. All known error records should be stored in the KEDB.
As soon as the diagnosis is complete, and particularly where a workaround has been found (even though it may not yet be a permanent resolution), A known error record must be raised and placed in the KEDB so that if further incidents or problems arise, they can be identified and the service restored more quickly.
In some cases it may be advantageous to raise a known error record even earlier in the overall process even though the diagnosis may not be completed or a workaround is found. This might be used for information purposes, or to identify a root cause, or a workaround that appears to address the problem but hasn’t been fully confirmed.
Once a root cause has been found and a solution to remove it has been developed, it should be applied to resolve the problem. In reality, safeguards may be needed to ensure that the resolution does not cause further difficulties. If any change in functionality is required, an RFC should be raised and authorized before the resolution can be applied. If the problem is very serious and an immediate fix is needed for business reasons, then an emergency RFC should be raised. The resolution should be applied only when the change has been authorized and scheduled for release. In the meantime, the KEDB should be used to help resolve quickly any further occurrences of the incidents/problems that occur.
When a final resolution has been applied, the problem record should be formally closed, as should any related incident records that are still open. A check should be performed at this time to ensure that the record contains a full historical description of all events and if not, the record should be updated.
Major problem review:
After every major problem (as determined by the organization’s priority system), and while memories are still fresh, a review should be conducted to learn any lessons for the future. Specifically, the review should examine those things that were done correctly, those things that were done wrong.
What could be done better in the future? How to prevent recurrence. Whether there has been any third-party responsibility and whether follow-up actions are needed.
Problem management includes the activities required to diagnose the root cause of incidents and to determine the resolution to those problems. It is also required to ensure that the resolution is implemented through the appropriate control procedures, especially change management, release and deployment management. Problem management will also maintain information about problems and the appropriate workarounds and resolutions, so that the organization is able to reduce the number and impact of incidents over time.
In this respect, problem management has a strong interface with knowledge management, and tools such as the KEDB will be used for both. Although, incident and problem management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.
The problem management process has both reactive and proactive aspects:
Reactive problem management is concerned with solving problems in response to one or more incidents
Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again
While reactive problem management activities are performed in reaction to specific incident situations, proactive problem management activities takes place as ongoing activities are targeted to improve the overall availability and end use satisfaction with IT services.
Examples of proactive problem management activities might include conducting periodic scheduled review of incidents. Records, to find patterns and trends in reported symptoms that may indicate the presence of underlying errors in the infrastructure.
Conducting major incident reviews where review of ‘How can we prevent the recurrence? can provide identification of an underlying cause or error. Conducting periodic scheduled reviews of operational logs and maintenance records identifying patterns and trends of activities that may indicate an underlying problem might exist.
Conducting periodic scheduled reviews of event logs targeting patterns and trends of warning and exception events that may indicate the presence of an underlying problem. Conducting brainstorming sessions to identify trends indicate the existence of underlying problems.
Using check sheets to proactively collect data on service or operational quality issues that may help to detect underlying problems. Reactive and proactive problem management activities are generally conducted within the scope of service operation. A close relationship exists between proactive problem management activities and CSI lifecycle activities that directly support in identifying and implementing service improvements.
Proactive problem management supports those activities through trending analysis and the targeting of preventive action. Identifying problems from these activities will become an input to the CSI register which is used to record and manage improvement opportunities.
Value to Business
Problem management provides value to business by providing:
Higher availability of IT services by reducing the number and duration of incidents that those services may incur
Problem management works together with incident management and change management to ensure that IT service availability and quality are increased
When incidents are resolved, information about the resolution is recorded
Over time, this information is used to speed up the resolution time and identify permanent solutions, reducing the number and resolution time of incidents. Higher productivity of IT staff by reducing unplanned labour caused by incidents and creating the ability to resolve incidents more quickly through recorded known errors and workarounds. Reduced expenditure on workarounds or fixes that do not work. Reduction in cost of effort in fire-fighting or resolving repeat incidents.