When pilots talk about things getting busy in the cockpit, they mean something very specific: clanging alerts, paralyzing uncertainty and the ground getting closer.
Sound familiar? The alerts you receive from your server monitors aren't quite as dramatic, but the outcome can be the same: a crash. You may walk away from it in one piece, but your company's business might not. Consequently, the right time to handle an emergency is before it becomes an emergency. As is usually the case, this is easier said than done. Here are the best server monitoring practices to ensure a safe landing:
1. Flight Characteristics
As with anything in network monitoring, it's important to establish a "normal." How does your network behave during normal operations? Until you know how things look when they are going right, you won't be able to respond effectively if something goes wrong. Continuously monitoring your network will give you a picture of its normal, baseline operation, and you can then calibrate your alerts so that they indicate significant (not trivial) deviations from your normal activity. The more familiar you are with your activity baseline, the better you'll be at recognizing and responding to deviations.
2. Weather Conditions
Know your external environment. The "weather" that a network flies through is the constant ebb and flow of traffic. Spikes leading to a storm of alerts can arise from predictable conditions — say, the Super Bowl — or in response to events that are impossible to predict (such as the release of dramatic news video footage). Knowing how your network will respond to heavy weather is critical. For example, as TechRepublic notes, this is when your "bandwidth hogs" will really go hog wild. When you fly into an alert storm, be ready with a plan.
3. It's Quiet — Too Quiet
A complete lack of alerts is a warning sign in and of itself; networks never run that smoothly. Even the most intuitive alerting tools generate a certain amount of "noise" during normal network operations. If your alerting system isn't producing any noise at all, something may be wrong with the alerting process, meaning you could be missing signs of developing trouble. Check your solution to make sure it's up and running properly — and adjust the threshold for alerts to make room for "allowable" deviations.
4. 'Cockpit Resources Management'
Chalk this up to air jargon for making the best use of crew skills. Alerting systems provide alerts based on severity, but are these alerts going to the right people at the right level? People should receive need-to-know alerts in an effective way (such as email or text message) without being flooded with alerts that can be handled at another level, subjecting them to "alert fatigue." If your setup is not calibrated to do this, it's time to adjust policy so that it does.
5. Don't Mute Alerts
When using email alerts, it is of utmost importance to make sure they aren't ending up in a spam folder. They won't do you any good there. The same goes for whichever alert media you use. Most alerts may only need a quick glance, but you do need to see them. Setting up alert escalation is a good way to make sure that alerts always reach someone, as they'll be sent on to the next person in the chain of command if they are not acknowledged.
6. Insist on Airworthiness
Monitoring and alert solutions are not created equal. If your server monitoring tools aren't up to the job, replace them with better ones as soon as you can. A homegrown tool can save you money, but it certainly won't save you time, and can lack the depth of visibility necessary to really stay on top of your network's goings-on.
7. Don't Panic!
The simplest advice of all, but the hardest to remember in an emergency; your network monitoring solution can't help you here — only experience and practice. It's important to have a plan of action. Cool nerves help, too.
8. First of All, Aviate!
Flight recorders recovered from crash sites reveal too many cases wherein aircrews got so wrapped up in specific problems they forgot the big picture — actually flying the plane. Remember, alerts are a tool. Don't let responding to them get in the way of maintaining a smooth flight for your users.
The best server monitoring practices aren't dependent on specific technology solutions. They are inputs that you alone can provide so the subsequent alerts work for you to help ensure your network has a comfortable flight with minimal turbulence.