Monitoring Complex Systems

Thursday Jun 17th 2004 by George Spafford

Complex networks require increasingly sophisticated monitoring systems. But all too often, monitoring systems are treated as afterthoughts, leading to security breaches, accidents and inefficiencies.

Complex networks require increasingly sophisticated monitoring systems. However, far too often, monitoring is an afterthought and not a holistically engineered part of the system. In fact, it is very common that the overall monitoring system is complicated and mission-critical, yet has varying degrees of documentation, training, fault-tolerance and security.

In order to improve, organizations must recognize that a monitoring system itself can cause problems and there are a unique set of issues that must be taken into account and mitigated.

Perceived Reliability

We must consider how people perceive the accuracy of the automated feedback systems. A properly designed monitoring system must be such that operators can realistically investigate and record the findings of all alerts raised or issues flagged.

In other words, the system must be a closed loop where in issues are raised, investigated, mitigated (if need be) and results logged. The problem is that as the number of erroneous alerts increase, the amount of personnel time wasted and level of frustration increases as well.

This "perceived reliability" is a key dynamic for any form of monitoring. If operators have expectations that are out of alignment with what the system can deliver, then they are far more likely to discount reports coming from that system and even falsify reports in order to "not waste time."

Far too many accidents have taken place due to operators assuming that messages were false positives when, in fact, the alerts were accurate. From this, we can posit The Law of False Alerts: As the rate of erroneous alerts increases, operator reliance, or belief, in subsequent warnings decreases.

If a complex system has an area where there are constant false alarms coming from a monitoring system used to detect a security breach, or any critical parameter for that matter, wouldn't that be a prime target by a hacker or terrorist? Whether it is an intrusion detection system that constantly reports non-existent incursions, a flaky motion sensor flagging movement that doesn't exist, or an open/closed sensor providing a false report about a valve's state, if it is a known weak link due to media reports or even the office rumor mill, then it is at risk of allowing a breach to happen.

What do we do?

First, we must treat monitoring as an intrinsic part of the overall system in question. By adding monitoring with little thought to a system, we risk monitoring the wrong events and/or wrongly interpreting reported data. In other words, there must be a holistic approach that identifies key performance indicators in the system, their acceptable bounds and key causal logic. "If these sensors register X, Y and Z then event Alpha must be taking place and the IT operations must be alerted immediately."

The human factor must be taken into account and careful planning of what events trigger an alarm, processes to validate results, layout of the messages and so on. Always bear in mind that as the level of false positives increases, faith in the monitoring system decreases. The monitoring system must not only be accurate, it must be viewed as accurate and as providing value to the operators or they will increasingly ignore it over time, perhaps to disastrous results. p Second, build "monitoring in-depth." This is a play on "defense in-depth" in that multiple sensors are arranged to confirm events.

For example, one potential scenario is that a more sensitive but more error-prone sensor is used to initially indicate a state and a less sensitive but more reliable sensor is used in series to corroborate the earlier "fast alert" probe.

Another scenario could involve an array of sensors used to confirm an event due to the critical need to be certain that the data collected is accurate. A single monitoring system is as susceptible to a single point-of-failure incident as any other system.

Third, plan for continuous improvement. Odds are high that most of the underlying systems monitored will evolve over time for one reason or another. In parallel, the monitoring system must evolve to continue meeting expectations.

A monitoring system that can only handle 10Mb/s will face a virtually impossible task if the underlying system is upgraded to gigabit speeds and it can't sample the data fast enough. Furthermore, these systems must be reviewed over time to ensure that they still align with operator requirements. For example, filters may need to be added or modified in order to screen out unnecessary "noise" that the operators are contending with that didn't initially exist (providing proper analysis is done to determine why the noise exists of course).

Fourth, treat monitoring as an important activity and have the appropriate engineering resources and processes, such as change advisory boards set up to review, approve and schedule changes.

Monitoring must evolve from a haphazard afterthought to a critical application with specified service levels identifying timeliness, accuracy, uptime and security. For all monitoring, and especially SCADA systems, there must be effective communication between functional groups to ensure the systems are designed, secured and maintained appropriately.


Complex systems require increasingly sophisticated monitoring systems. Care must be taken to design secure systems that meet requirements and are perceived as accurate by the operators.

If a monitoring system is perceived as not adding value, operators will depend on it less and less. This, in turn, creates a fertile environment for security breaches, accidents and all types of inefficiencies. With that in mind, monitoring systems must consistently evolve from afterthoughts to well engineered systems to ensure expectations are met.

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved