The vast majority of IT infrastructure problems relate to the network. Afterall, most of IT infrastructure IS the network. Makes sense.
But its immensity and complexity make the network a bear to troubleshoot and root cause analysis as tricky as finding the proverbial needle in the network haystack.
Making root cause analysis tough is the fact that computing is more distributed than ever.
Key applications run both on-premises and as SaaS in the cloud. Some are even hybrid, so the processing is shared between the two. That means network IT pros must monitor the cloud and on-premises infrastructure to troubleshoot hybrid apps.
Additional complexity comes from the fact that so few in-house servers are not virtualized. Since these single servers are virtually transformed into many, it is hard for IT to find which VM is at fault. Today it is not just the servers that need monitoring, but the virtual machines as well.
Meanwhile, the problems IT pros need to hunt down to the root cause are varied. The worst are the deal breakers where the network or an application is completely down. Not as bad, but often tougher to trace, involve slow performance.
You know how frustrated your auto mechanic gets when you bring in your car for a problem that comes and goes and is gone when the mechanic takes a look. That same issue troubles IT pros. “The biggest headache for IT is dealing with intermittent performance problems. These are those problems that make themselves apparent and disappear before you can identify the source, only to happen again and again, but randomly. In most cases, these intermittent performance problems look like they are rooted in a certain area of your network where in fact they are stemming from a completely different one,” explained the Progress IT Pro’s Guide to Faster Troubleshooting eBook. “More than one-third of the IT pros surveyed were able to fix intermittent performance issues within minutes. However, almost the same number of respondents spent hours finding the source of the problem, others taking days and even months to resolve. The harder the problem is to find, the more downtime can accumulate over the course of a year.”
Network monitoring allows IT to be proactive rather than reactive. “The tool will ideally provide an early warning system when an issue starts to arise that could lead to unhappy users and downtime,” the eBook argued.
The key to root cause analysis is fully knowing your network. If you have a network monitoring solution, it will send an alert when there is trouble. A smart IT pro will go to the network maps to begin the diagnosis. This network visualization cuts troubleshooting time by hours, sometimes even days.
A good network monitoring solution should discover Layer 2 and Layer 3 network information and use this data to automatically generate network topology maps. These are the very maps that help lead IT to the root cause and accelerate response while reducing mean time to resolution (MTTR).
Application problems are tricky to hunt down. It is not just the software, but these applications are dependent on enabling technologies like web servers, databases and network elements. They are also often dependent on other applications, and when they fail they take down or disrupt the dependent applications.
Your network monitoring solution should allow IT to define and then monitor these dependencies to help track the state of an application. This should also be part of the alerting system. If an application is dependent upon a network element and that element fails, the alert should point IT first to a failed element, not the application.
Let’s say that SharePoint webpages are no longer working. That seems to be the problem, but if these pages depend upon Microsoft Internet Information server (IIS) and IIS fails, then that is the real culprit.
Tracking application states is key to warding off problems or finding the root cause when they occur. “An IT monitoring system can support several application states– the up state, the down state, the warning state and the maintenance state. This allows IT to define an application state by assigning threshold values to monitored performance metrics,” the Progress IT Pro’s Guide to Faster Troubleshooting eBook explained.
Tracking devices that support the application is vital to heading off problems. “For example, when CPU utilization for a process exceeds 75% on a server, this should put the application in the warning state. When CPU utilization exceeds 90%, IT should be alerted that the application is in the down state. This provides IT managers with early warning and ample time to respond to performance problems before impacting users and the business,” the eBook argued.
These days, when end users imagine the network they think of Wi-Fi. Chances are Wi-Fi is everywhere in your organization, and the complexity and volume of devices makes wireless root cause analysis difficult. Don’t fret. Let your network monitoring solution do the heavy lifting.
“When it comes to wireless networks, displaying wireless LAN controllers (WLC), access points and clients is crucial. These maps should get updated with every polling cycle to show new clients as they log onto the wireless network,” the Progress eBook stated. “When a wireless network end user complains about performance, a wireless network map lets you follow the connection between the client, access point and WLC. You can also see all the other clients connected to the same access point, possibly indicating an oversubscription problem. The first question you should ask when a network issue arises is: ‘Do I have an access point capacity problem?’”
Wireless network infrastructure needs to be looked at in real-time, but to get a sense of overall health and trends, historical data should be gathered to show patterns in items such as client count and bandwidth usage. “This allows you to correlate graphs to the time when a performance problem was reported. By analyzing patterns in the number of clients connected to an access point, and the corresponding bandwidth usage, you can determine if the access point can handle wireless volumes at peak usage,” the eBook stated. “Historical graphs covering WLC, CPU, and memory utilization should also be viewed in multiple time measurements to expose patterns that can be correlated to the timing of reported performance problems. High utilization of either of these resources indicates that a WLC can’t keep up with peak usage on the wireless network. If you are satisfied with the wireless capacity your users have – but they aren’t – you may want to ask yourself if you have a signal strength problem.”
Solving performance and uptime problems on the network is a constant race against the clock. To win, IT needs visibility into all the silos that comprise your IT infrastructure.
Unified network monitoring should handle these items:
Unified network monitoring begins by discovering all the devices, applications and services on the network, then tracing the connectivity and dependencies between them.
Root Cause Analysis is not well supported by siloed monitoring tools where each device type has its own monitoring solution. How many consoles can an IT pro stare at before they go crazy?
Instead, a monitoring solution with broad and deep visibility into the network and its services is what’s called for. You may still have siloed tools for on-premises networks, core applications, OSes and virtual servers, but relying entirely on only these makes it tough to pinpoint the root cause of complex—and often deeply hidden—network problems.
It is far more efficient to have a single network monitoring solution that sees everything, rather than IT manually sifting through alerts and the data from a gaggle of discrete siloed monitoring tools.
Looking to start on the basics of IT infrastructure monitoring? Our alphabetized index is an excellent place to begin or extend your education. View all of our current topics.
Get our latest blog posts delivered in a weekly email.