How to create Effective Outage/Alert communication process
It is required that service providers send effective communication to users in case there is any scheduled maintenance planned or failure of the services. If delays are to be expected, it is important to provide an estimated resolution time so that staffs/customers can plan their time accordingly. It is possible only in cases when the outage is planned. Status update is also required in order to keep those people informed who are most affected by a failure. Right Business/ Stake holders and correct Technical Teams should be included via an email communication.
Benefits of sending communication:
Giving updates by communication reduces unexpected spike in Incident & call volume. For an example some highly used Application/Service is not working and customers are not been updated by sending any notification. Just imagine how many incidents will be created in such circumstances. It will be far too many to control.
Also, it’s important that technical teams should be made aware about the SLA Agreements between the service provider and customer in order to get exonerate from the Outage situation quickly.
There can be multiple teams involved in fixing the issue on the Applications/Services. It is possible that other teams also send the notification to almost same list of users for the same issue. This may create confusion. In my organization, there was a similar case. I did an analysis for 6 months and found that a team had sent the Outage/Alert notification 8 times for the same issue as my team. In such cases, it is advisable to have a single source of communication with Incident Management team to avoid duplication.
The periodic updates that are to be sent should be concise, and what is needed by the stake holders. The communication can be sent by any communication tool but the format should be mobile friendly as most management users access through their device. Final communication can involve more technical details. Ideally, it should have subject for which the notification is to be sent, day & date, time started, time ended, Impact - Sites & Service affected, actions taken, actions required, next steps, contact information and resolution and (if known) route cause. This line may include, work around details or problem record details for RCA.
Identifying and involving stakeholders is very important. I think it's the best, when in the project; all IT teams (who do Service Operation tasks) are represented. One can identify the problems with the recent outage/alert communication, and collect the requirements for improvement. If one doesn't involve all relevant stakeholders, they may not accept the project outcome. Management involvement is necessary too (manage and approve resources to the project). The requirements of the Business are also important. If you have Business relationship managers, they can collect the requirements of the business.
One can make a communication matrix: who should be informed by which outage. It's important to define the roles and responsibilities: who does what and when. A new forum about the planned outages can also be established. For example: a newsletter, where all business stakeholders get information about the planned IT outages. Or templates can be created for the regular scheduled maintenance outages.
For each grouping of similar outages, examine the reasons for the duration of the unavailability. For example the outage may have occurred because of faulty hardware or software; but the duration of the unavailability might have been extended by lack of tools, training, spares, etc.
One should consider the three “P’s” – People, Product and Process, and should review the existing procedures/support policies that were invoked or used during this outage & actions (or inactions) of members involved in the outage or restoration.
One should determine if anything might have lessened the duration of the outage, or avoided it altogether. It can be done by creating a problem ticket and finding the root cause of the issue. An example might be the lack of a tool, process or a similar item related to the issue. For the Exception event, that means a service or device is currently operating below the normal parameters/indicators (predefined). Thus, the business is impacted and the device or service presents a failure, performance degradations or loss of functionality (web server down, CS coverage lost for several sites). Major Incident should be created so that all the actions taken can be recorded well & a problem ticket can be created. Once the cause is known, it can be fixed permanently to avoid the reoccurrence of the same.
Author : SiddharthPareek