SYS-CON MEDIA Authors: Zakia Bouachraoui, Liz McMillan, Carmen Gonzalez, Roger Strukhoff, David Linthicum

Related Topics: @CloudExpo, Containers Expo Blog, Cloud Security

@CloudExpo: Article

IT Monitoring Clickbait | @CloudExpo #APM #Cloud

Here are a few common alerting problems, along with the reasons they often crop up and how to solve them

It is a sad but very real truth that many, dare I say most, IT professionals consider alerts to be the bane of their existence. After all, they're annoying, noisy, mostly useless and frequently false. Thus, we IT professionals who specialize in IT monitoring are likely well acquainted with that familiar sinking feeling brought on by the discovery that the alert you so painstakingly crafted is being ignored by the team who receives it.

In that moment of professional heartbreak, you may have considered changing those alerts to make them more eye-catching, more interesting and more urgent. To achieve this, you might have even considered choosing from a menu of possible alert messages. For example:

  • Snarky: Hey server team! Do you even read these alerts anymore?!?
  • Hyperbolic: DANGER, WILL ROBINSON! Router will EXPLODE in 5 minutes!
  • Sympathetic (or just pathetic): Hey, I'm the IIS server and it just got really dark and cold in here. Can someone come turn the lights back on? I'm afraid of the dark.

Or you may have considered going the clickbait route. For example:

  • This server's response time dropped below 75 percent. You won't believe what happened next!
  • We showed these sysadmins the cluster failure at 2:15 a.m. Their reactions were priceless.
  • You swore you would never restart this service. What happened at 2:15am will change your mind forever.
  • Three naughty long-running queries you never hear about.
  • Hot, Hotter, Hottest! This wireless heat map reveals the Wi-Fi dead zones your access points are trying to hide!
  • Watch what happens when this VM ends up next to a noisy neighbor. The results will shock you!

While all of the above approaches are interesting to say the least, they, of course, miss the larger point: it is deceptively difficult to craft an alert that is meaningful, informative and actionable. To combat this issue and ensure teams are poring over your alerts in the future without needing tantrums, gimmicks or bribery, here are a few common alerting problems, along with the reasons they often crop up and how to solve them.

Problem: Multiple alerts (and tickets) for the same issue, every few minutes

Reason
This issue is called "sawtoothing" and describes a situation when a particular incident or condition happens, then resolves, then happens again, and so on and so forth, and your monitoring system creates a new alert each time.

Solution:
To solve this, first understand that some sawtoothing is an indication of a real problem that needs to be fixed. For example, a device that is repeatedly rebooting. But usually this happens because a device is "riding the edge" of a trigger threshold; for example, if a CPU alert is set to trigger at 90 percent, and a device is hovering between 88 and 92.

There are a few common approaches to solving the issue:

  • Set a time-based delay in the alert trigger so that the device has to be over a certain percentage CPU for more than a pre-set number of minutes. Now, the alert will only find devices that are consistently and continuously over the limit.
  • Use the reset option built into any good monitoring solution and set it lower than the trigger value. For example, set the trigger when CPU is over 90 percent for 10 minutes, but only reset the alert when it's under 80 percent for 20 minutes. This reset option establishes a certain standard of stability within the environment.
  • Use the ticket system API to create two-way communication between the monitoring solution and the ticket system that ensures a new ticket cannot be opened if there is already an existing ticket for a specific problem on that device.

Problem: A key device goes down - for example, the edge router at a remote site - and the team gets clobbered with alerts for every other device at the site

Reason
If the visibility of a particular device is impaired, monitoring systems sometimes call that "down." However, that doesn't necessarily mean it is down; a device upstream could be down and nothing further can be monitored until it comes back up.

Solution
Any worthwhile monitoring solution will have an option to suppress alerts based on "upstream" or "parent-child" connections. Make sure this option is enabled and that the monitoring solution understands the device dependencies in your environment.

Problem: You have to set up multitudes of alerts because each machine is slightly different

Reason
You may find yourself having to set up the same general alert (CPU utilization, disk full, application down, etc.) for an ungodly number of devices because each machine requires a slightly different threshold, timing, recipient or other element.

Solution
We monitoring engineers find ourselves in this situation when we (or the tool we're using) don't leverage custom fields. In other words, any sophisticated monitoring solution should allow for custom properties for things like "CPU_Critical_Value." This is set on a per-device basis, so that an alert goes from looking like this, "Alert when CPU % Utilization is >= 90%," to this, "Alert when CPU % Utilization is >= CPU_Critical_Value."

This solution allows each system to have its own customized threshold, but a single alert can handle it all. This same technique can be used for alert recipients. Instead of having a separate, but identical alert for CPU for the server, network and storage teams, each device can have a custom field called "Owner_Group_Email" that has an email group name. Then you create a single alert where the alert is sent to whatever is in that field.

Problem: Certain devices trigger at certain times because the work they're doing causes them to "run hot"

Reason
During the normal course of business, some systems have periods of high utilization that are completely normal, but also completely above the regular run rate. This could be due to month-end report processing; code compile sequences overnight or on the weekend; or any other cyclical, predictable operation.

The problem here is that the normal threshold for the condition in question is fine, but the "high usage" value is above that, so an alert triggers. But if you set the threshold for that system to the "high usage" level, you will miss issues that are important but often lower than the higher threshold.

Solution
Rather than triggering a threshold on a set value - even if it is set per device as described earlier - you can use the monitoring data to your advantage. Remember, monitoring is not an alert or page, nor is it a blinky dot on a screen. Monitoring is nothing more (or less) than the regular, steady, ongoing collection of a consistent set of metrics from a set of devices. All the rest - alerts, emails, blinky dots and more - is the happy byproduct you enjoy when you do monitoring correctly.

If you've been collecting all that data, why not analyze it to see what "normal" looks like for each device? This is called a "baseline" and it reflects not just an overall average, but also the normal run rate per day and even per hour. If you can derive this "baseline" value, then your alert trigger can go from, "Alert when CPU % utilization is >= <some fixed value>," to, "Alert when CPU % utilization is >= 10% over the baseline for this time period."

IT pros tried these weird monitoring tricks and the results will shock you!
When monitoring engineers implement and use the capabilities of their monitoring solutions to the fullest, the results are liberating for all parties. Alerts become both more specific and less frequent, which gives teams more time to actually get work done. This in turn causes those same teams to trust the alerts more and react to them in a timely fashion, which benefits the entire business. Best of all, everyone experiences the true value that good monitoring brings and starts engaging us monitoring engineers to create alerts and build insight that helps stabilize and improve the environment even more.

"Monitoring Team Saved This Company $$$" isn't some fake headline designed to get clicks. With a little work, it can be the truth for every organization.

For even more alerting insights, check out the latest episode of SolarWinds Lab here - what happens at 22:17 will blow you away!

More Stories By Leon Adato

Leon Adato is a Head Geek and technical evangelist at SolarWinds and is a Cisco® Certified Network Associate (CCNA), MCSE and SolarWinds Certified Professional (he was once a customer, after all). His 25 years of network management experience spans financial, healthcare, food and beverage, and other industries.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Latest Stories
Moroccanoil®, the global leader in oil-infused beauty, is thrilled to announce the NEW Moroccanoil Color Depositing Masks, a collection of dual-benefit hair masks that deposit pure pigments while providing the treatment benefits of a deep conditioning mask. The collection consists of seven curated shades for commitment-free, beautifully-colored hair that looks and feels healthy.
The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
We all love the many benefits of natural plant oils, used as a deap treatment before shampooing, at home or at the beach, but is there an all-in-one solution for everyday intensive nutrition and modern styling?I am passionate about the benefits of natural extracts with tried-and-tested results, which I have used to develop my own brand (lemon for its acid ph, wheat germ for its fortifying action…). I wanted a product which combined caring and styling effects, and which could be used after shampo...
The precious oil is extracted from the seeds of prickly pear cactus plant. After taking out the seeds from the fruits, they are adequately dried and then cold pressed to obtain the oil. Indeed, the prickly seed oil is quite expensive. Well, that is understandable when you consider the fact that the seeds are really tiny and each seed contain only about 5% of oil in it at most, plus the seeds are usually handpicked from the fruits. This means it will take tons of these seeds to produce just one b...
Steaz, the nation's top-selling organic and fair trade green-tea-based beverage company, announces its 2017 "Mind. Body. Soul." tour, which will bring authentic experiences inspired by the brand's signature Mind. Body. Soul. tagline to life across the country. The tour will inform, educate, inspire and entertain through events, digital activations and partner-curated experiences developed to support the three pillars of complete health and wellness.
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
ScaleMP is presenting at CloudEXPO 2019, held June 24-26 in Santa Clara, and we’d love to see you there. At the conference, we’ll demonstrate how ScaleMP is solving one of the most vexing challenges for cloud — memory cost and limit of scale — and how our innovative vSMP MemoryONE solution provides affordable larger server memory for the private and public cloud. Please visit us at Booth No. 519 to connect with our experts and learn more about vSMP MemoryONE and how it is already serving some of...
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
Platform9, the leader in SaaS-managed hybrid cloud, has announced it will present five sessions at four upcoming industry conferences in June: BCS in London, DevOpsCon in Berlin, HPE Discover and Cloud Computing Expo 2019.
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
When you're operating multiple services in production, building out forensics tools such as monitoring and observability becomes essential. Unfortunately, it is a real challenge balancing priorities between building new features and tools to help pinpoint root causes. Linkerd provides many of the tools you need to tame the chaos of operating microservices in a cloud native world. Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. I...