I recently made a presentation, in the context of the TFT12 event, on techniques for advanced incident management. One of the important tools to use is the incident model. During the presentation, a tweet was made by a listener asking if the number of different incidents types, and hence models, would get out of hand. Since Twitter is not an adequate medium to respond to that excellent question, I will attempt to respond here.
There are three distinct issues to be addressed:
- What makes for a distinct incident model?
- Under what circumstances should a type of incident be modeled?
- How can we find a specific model if there is a large number of them?
I will conclude with a brief discussion of maturity levels in the use of incident models.
What makes an incident model distinct?
An incident model serves no purpose unless it can be used for multiple incidents that are likely to occur in the future. If every incident is treated as a unique event, sui generis, clearly a model will not help. So, the question is how to group different incidents in a useful way so that a single model may be useful for a variety of incidents.
Recall, first, the types of information we expect to find in a model:
- what symptoms need to be collected
- how to classify the incident in the incident log
- to whom the incident should be assigned, if functional escalation is necessary
- how to resolve the incident
- how to restore the impacted services to their normal state.
If we set aside the last point—information on how to restore the impacted services—the basic grouping principle will be how to resolve the incident.
For example, many incidents concerning a computer may be resolved by re-booting the computer. The reason why the computer needs re-booting is not the concern of incident management. That issue is handled by problem management. It suffices to collect sufficient information about the symptoms to know that the resolution is via re-booting. This same information also determines how to assign the incident, if required. I emphasize if required because a reliable and well thought out system may very well allow for automated re-booting, without human intervention. A sketch of an incident model maturity model may be found below.
Another example might be the resolution of an incident due to an application bug. These incidents may be resolved either via a work-around, in which case the developers are not necessarily involved, or they may be resolved by fixing the bug and releasing a patch. From the perspective of the incident model, it suffices to identify that no work-around is possible and the intervention of developers is required. Of course, each different bug will have its own potential impact, but the resolution—getting a developer to fix the bug and releasing a patch—would be the same. This example also shows why an incident model can only suggest a default impact. The actual impact of an incident might vary from case to case and would need to be tuned.
The issue of how to restore normal service is more complicated. Taking as an example an incident that results in the failure of the job scheduler, the precise steps to take to restore services will depend on which jobs have failed and which jobs have remained in the queue. This will depend entirely on the day and the time of the failure, as well as the duration of the resolution activities. Consequently, certain models cannot give precise information about how to restore the impacted services.
A more sophisticated model will include rule-, data- or state-driven logic. The steps for restoring a service may depend as much on the service itself as on the incident that has disrupted it. Therefore, the model might include links to information specific to the disrupted services, rather than hard-coding the steps. This approach greatly simplifies the need for different models, allowing one model to be used for a broad array of different services.
Similarly, the assignment of the incident or even the resolution steps might be determined by the specific component that has failed. To use the example given above, the assignment for re-booting a Windows server might be different than the assignment for re-booting a UNIX server. As long as we can distinguish the component impacted, the same incident model could be used for both types of machines, even though the responsibilities are different.
Under what circumstances should a type of incident be modeled?
If we follow the suggestion of using an approach such as Failure Modes and Effects Analysis to support the development of incident models, we can benefit from the prioritization step in FMEA to decide whether or not to develop an incident model. In particular, the likelihood of the incident occurring and the likelihood of detection must be sufficiently high in order to justify the investment in creating and maintaining the model.
Certain incident types will be be deemed as too trivial to merit the development of a model, either because of the limited impact or because of their unlikelihood. Finally, the resources of each organization are finite, making it impossible to develop all the models that might, in theory, be of interest. All these factors will serve to limit the number of incident models.
As a corollary, we should should also consider that incidents model have a certain lifetime during which some maintenance may be required. At some point, a model should be retired or merged with another model, insofar as the likelihood of its use approaches zero.
Finding the right incident model
There might be, in theory, a way to design an ontology of incident models and use that ontology as a means for easily finding the right model. But it is much more likely that models will be developed pragmatically, based on specific issues as they arise. The result of this bottom-up approach will be an unstructured list of incident models. A model will be of no use unless it is easily and accurately found, during the early stages of handling an incident. What can be done to facilitate finding models?
In practice, a model is generally implemented in an incident management tool as a template for an incident record. As the support person works through the initial steps of creating the incident record, at some point a decision may be made to search for a model. If, at that point, the support person is simply presented with a list of models, sorted alphabetically by their titles, and if that list is long, the likelihood of finding the right model is considerably reduced. What can be done to make it easier to find the right model?
In fact, this issue is fundamentally a knowledge management issue. The same techniques that are used to find the right knowledge item may be used here to find the right incident model. However, to benefit from these techniques, the tool in use has to include the relevant functionality. If the tool provides nothing more than a sorted drop-down list, finding the right model might be difficult, if that list is long.
There are various knowledge management techniques in use. One technique is to organize the models hierarchically. This technique is probably useless, or costs more the implement and maintain than the benefits one might get from it. A second technique is to tag the model with certain key words. These may be provided manually or may be determined automatically on the basis of the content of the model itself. Indexing the full text of the fields of the model is probably very useful. Furthermore, using the data already known about the incident may help to identify the correct model to use. This data typically would include any error messages by which the incident was detected; the category of CI on which the incident was detected; the identify and organizational unit of the user raising the incident; etc. A more sophisticated tool would use a Bayesian approach to learn from past use of the models in order to refine the retrieval criteria and propose in a list the few models that are most likely to apply to a given incident.
A maturity model for the use of incident models
From the highest perspective, I see three levels of maturity:
- No use of models
- Manual use of models
- Use of models to automate event and incident handling
In levels 2 and 3, the most significant differences in maturity concern the extent of use of models; and concern the maintenance of the models throughout their lifetimes. It is common for organizations to use incident templates for the types of calls that the service desk receives very frequently. As the organization matures, it will extend the use of these templates to other functional units and types of incidents. But the work does not stop with the creation of the model. As the infrastructure, services and organization evolve, these models must be maintained. They may be adapted according the changing circumstances. They may be retired when no longer of any use. They may be merged with other models as it becomes evident that different models can be simplified. And there may be cases where a single model needs to be split into two or more different models.
When a model first comes into use, it may be necessary to test it and confirm its accuracy, applicability and completeness. Manual intervention might be required to adapt the values proposed in the model. However, as the model becomes better tuned and more reliable, it may become possible to fully automate the selection of the model and the implementation of the incident resolution. As such, incident models may be an important step towards the creation of self-healing systems.
Leave a Reply