It looks the average IT staff spends more of their time cleaning up a crisis instead of preventing one. Our Datamation columnist looks at ways to fix that problem.
Where do IT staffers spend their time? Good question.
And there's a myriad of answers to that one question. And most of the answers point to the fact that a lot of time is spent cleaning up a crisis instead of preventing one or building the business. And upon evaluation, it seems that the answer to freeing up staff time is to present information and needs to them in a rational fashion before a crisis develops.
And that would be the hard part.
After years of ignoring Linux, Microsoft Corp. is taking it head on. For example, anyone reviewing Linux articles might have noticed recently that they often see accompanying Microsoft ads touting analyst studies showing that Windows Server has a lower Total Cost of Ownership than Linux. Although using Linux would result in lower expenses for hardware, software and downtime, some analysts projected that its higher staffing and IT staff training costs would make it 11 percent to 22 percent more expensive than Windows for most uses over a five-year period.
Without getting into the merits of that particular study, what it illustrates is that support far outweighs the initial cost of any technology.
For example, in file serving, IDC calculated that the Linux hardware/software cost would run $4,148 for 100 users, but the staffing and training was more than 20 times that amount -- $88,874. To cut IT expenses, therefore, the first action is find ways to improve staff efficiency, rather than shaving a few dollars off equipment leases. The other is to cut downtime, which accounted for 10 percent to more than 40 percent of the TCO on the servers in the study.
Let's take a look at counting IT hours.
Efficiency and downtime, of course, are closely related. Improving efficiency frees up time to spend on preventing downtime. And when staff aren't running around extinguishing downtime fires, they have more time for proactive system management, resulting in greater efficiency.
Consulting firm Infonetics Research, based in San Jose, Calif., conducts regular studies on how downtime affects the enterprise. Their Year 2000 report examined 85 companies with 1,000 or more employees and found an average cost of more than $32 million annually in lost revenue and productivity.
In last year's report, rather than going over the same ground as they had before, analysts looked in depth at six organizations in different industries to provide a more granular view of specific causes of outages and performance degradations.
''All areas of downtime are worth investigating, but based on the results of this research, it is likely that many companies could seriously improve their bottom line by investing in technologies that decrease the number and duration of service degradations, because they have a huge (and often hidden) impact on both revenue and productivity loss,'' analysts reported in the study.
But this leads back to the problem stated earlier. Even if you have the money needed to invest in such technologies, staff are too busy dealing with the crisis of the moment to install, learn, manage and use those tools.
''It is crucial for network managers to spend more of their management time on planning to reduce the time spent on [the other functions],'' the report states. ''This will help them to move out of the reactive fire-fighting mode and into a more proactive approach of managing their networks.''
Digging Yourself Out
Getting out of the trap is one of those chicken and egg situations. If you spent more time proactively monitoring system health and then planning and implementing strategies to keep it that way, then you wouldn't need to spend as much time handling emergencies. But you have too many emergencies needing immediate attention to do all the monitoring and managing you should.
Most companies have more than enough data to do an adequate job of managing the IT resources.
To begin with, there is the wealth of data that is stored in logs. When there is a crash, an administrator must go hunting through logs looking for the sequence of errors which occurred earlier. Now, if the administrator had looked earlier, he would have known there was trouble looming and could have taken preventive measures, but who has time to continually monitor all the dozens, hundreds or thousands of logs a company has?
''Our admins would go in after a crash and look at the logs to see what happened right before a server locked up,'' says Steve Luciano, network administrator for New Pig Corp. which provides products for liquid management, industrial safety and plant maintenance to more than 170,000 customers in more than 40 countries. ''But no one was checking their boxes on a regular basis. It was difficult to do considering how many servers they were responsible for and everything else they had to do.''
Zurich Life Insurance in Schaumburg, Ill. faced a similar problem.
''It was clear that the IT organization was in a reactionary mode,'' says Tim Hagn, Zurich's vice president of IT Operations and Engineering, describing what he found when he arrived on the job. ''We were addressing problems after the customer base had been affected or had called with a problem, which is not a successful mode to be in.''
In both cases, the answer was to gather the information that already existed throughout the network and present it to the administrators in a single console. For Hagn, the solution was to install Hewlett-Packard Co.'s HP OpenView and set it up to assemble the information from lower-level monitoring software and present it in a single console.
''Tools like IBM's Tivoli or BMC's patrol can't monitor Cisco's devices as well as Cisco's tools do,'' he explains. ''It works best to let the vendors' tools monitor their own devices and then dump the information into a central console.''
For Luciano, this meant buying and installing a log monitoring tool. He selected Logalot from Sanford, Me.-based Somix Technologies, Inc., which collects entries from Syslog, SnmpTrap and Windows Eventlogs and puts them into a combined database. He set it up to assemble the data from all his servers, switches and routers.
At that point he was able to establish policies on what to do with each of those entries. The vast majority are just informational and get archived. But others require immediate action and so the admins get alerted. This allows the staff to address potential problems before they cause a delivery outage. In doing so, they have been able to fix the underlying issues so they dont keep happening.
''Whenever they see the alert, they take the corrective action so the number of alerts has decreased,'' says Luciano.
And this is the real answer to freeing up staff time.
Presenting the information to the IT staff in a comprehensible fashion before it becomes a crisis. That can never be done if they have too many places to scan to get a complete picture of what is going on.
''I dont want to look at 17 different consoles,'' says Hagn. ''I want it all tying into one central location. The criteria now for every additional tool or utility is how well it can tie into OpenView.''