@work Toolbox

How I manage operational risks – Maintenance level versus risk


As per my regional role at work supporting several different business units, I do not own any budget and thus do not own risk responsibility while it comes down to renew maintenance or refresh Telecom devices. However my biggest challenge is to make sure budget owner clearly understand potential risk behind the final decision.
I do split IT device lifecycle in two activities: « Operating » & « Keep Environment Current »
I will cover the first one as per this post and second in another one later.

Operating activities refer to manage on a daily basis of IT infrastructure components by making sure they are running/servicing as expected. To do so, we might sometime need to run bug fix update, engage RMA process, or escalate to vendor technical experts to fix an issue.
Most of the time such services are available through an active maintenance contract which need to be renewed. Such maintenances are like insurance: they are optional as they are not needed to run the devices but can be very helpful in case of big issue.

This is where you are deciding to position your risk trigger and our job as per technologists is to translate technologies impact/consequences into risk language that business can understand and reuse.

In my current company we do have lot of subject matter experts in several technologies we are using: routing/switching, security, proxy, wan optimization, VoIP, etc. On that note we don’t really need level 1 nor level 2 vendor support.
Our environment is not really moving and we do have historical sites deployed with stable configuration/OS version. It has been pretty rare we had to do bug fix upgrade.
However hardware replacements still something we need to insure since keeping a « one for one » spare model devices quickly become unsustainable on a cost point of view.

Since some of my internal customers are not that technical, I’m trying to communicate as much as possible with templates and graphics in order to activate no-brain decisions/discussions around technologies. Thus I tried to map risk with a number in order to ease final assessment/budget owner decision.

My risk score is coming from a simple matrix confronting for each technology its role (affecting service impact/downtime) and maintenance level (affecting time to repair). For instance a faulty WAN optimization appliance (which can be by-passed) will have less service impact than WAN edge router. So the risk will be different.
The matrix below is only covering network technologies and should be adjusted depending on your infrastructure design/business requirements:


Here is an example of recommended maintenance level per network technology with potential risk associated:


So far, using such visual templates and scoring risk saved me time as simplifying discussions as limiting them to how much risk the budget owner is willing to take.

And you, how are you dealing such situation yourself?