SYS-CON MEDIA Authors: Yeshim Deniz, Liz McMillan, Elizabeth White, Maria C. Horton, Andy Thurai

Related Topics: @CloudExpo, Containers Expo Blog, Cloud Security

@CloudExpo: Article

IT Monitoring Clickbait | @CloudExpo #APM #Cloud

Here are a few common alerting problems, along with the reasons they often crop up and how to solve them

It is a sad but very real truth that many, dare I say most, IT professionals consider alerts to be the bane of their existence. After all, they're annoying, noisy, mostly useless and frequently false. Thus, we IT professionals who specialize in IT monitoring are likely well acquainted with that familiar sinking feeling brought on by the discovery that the alert you so painstakingly crafted is being ignored by the team who receives it.

In that moment of professional heartbreak, you may have considered changing those alerts to make them more eye-catching, more interesting and more urgent. To achieve this, you might have even considered choosing from a menu of possible alert messages. For example:

  • Snarky: Hey server team! Do you even read these alerts anymore?!?
  • Hyperbolic: DANGER, WILL ROBINSON! Router will EXPLODE in 5 minutes!
  • Sympathetic (or just pathetic): Hey, I'm the IIS server and it just got really dark and cold in here. Can someone come turn the lights back on? I'm afraid of the dark.

Or you may have considered going the clickbait route. For example:

  • This server's response time dropped below 75 percent. You won't believe what happened next!
  • We showed these sysadmins the cluster failure at 2:15 a.m. Their reactions were priceless.
  • You swore you would never restart this service. What happened at 2:15am will change your mind forever.
  • Three naughty long-running queries you never hear about.
  • Hot, Hotter, Hottest! This wireless heat map reveals the Wi-Fi dead zones your access points are trying to hide!
  • Watch what happens when this VM ends up next to a noisy neighbor. The results will shock you!

While all of the above approaches are interesting to say the least, they, of course, miss the larger point: it is deceptively difficult to craft an alert that is meaningful, informative and actionable. To combat this issue and ensure teams are poring over your alerts in the future without needing tantrums, gimmicks or bribery, here are a few common alerting problems, along with the reasons they often crop up and how to solve them.

Problem: Multiple alerts (and tickets) for the same issue, every few minutes

Reason
This issue is called "sawtoothing" and describes a situation when a particular incident or condition happens, then resolves, then happens again, and so on and so forth, and your monitoring system creates a new alert each time.

Solution:
To solve this, first understand that some sawtoothing is an indication of a real problem that needs to be fixed. For example, a device that is repeatedly rebooting. But usually this happens because a device is "riding the edge" of a trigger threshold; for example, if a CPU alert is set to trigger at 90 percent, and a device is hovering between 88 and 92.

There are a few common approaches to solving the issue:

  • Set a time-based delay in the alert trigger so that the device has to be over a certain percentage CPU for more than a pre-set number of minutes. Now, the alert will only find devices that are consistently and continuously over the limit.
  • Use the reset option built into any good monitoring solution and set it lower than the trigger value. For example, set the trigger when CPU is over 90 percent for 10 minutes, but only reset the alert when it's under 80 percent for 20 minutes. This reset option establishes a certain standard of stability within the environment.
  • Use the ticket system API to create two-way communication between the monitoring solution and the ticket system that ensures a new ticket cannot be opened if there is already an existing ticket for a specific problem on that device.

Problem: A key device goes down - for example, the edge router at a remote site - and the team gets clobbered with alerts for every other device at the site

Reason
If the visibility of a particular device is impaired, monitoring systems sometimes call that "down." However, that doesn't necessarily mean it is down; a device upstream could be down and nothing further can be monitored until it comes back up.

Solution
Any worthwhile monitoring solution will have an option to suppress alerts based on "upstream" or "parent-child" connections. Make sure this option is enabled and that the monitoring solution understands the device dependencies in your environment.

Problem: You have to set up multitudes of alerts because each machine is slightly different

Reason
You may find yourself having to set up the same general alert (CPU utilization, disk full, application down, etc.) for an ungodly number of devices because each machine requires a slightly different threshold, timing, recipient or other element.

Solution
We monitoring engineers find ourselves in this situation when we (or the tool we're using) don't leverage custom fields. In other words, any sophisticated monitoring solution should allow for custom properties for things like "CPU_Critical_Value." This is set on a per-device basis, so that an alert goes from looking like this, "Alert when CPU % Utilization is >= 90%," to this, "Alert when CPU % Utilization is >= CPU_Critical_Value."

This solution allows each system to have its own customized threshold, but a single alert can handle it all. This same technique can be used for alert recipients. Instead of having a separate, but identical alert for CPU for the server, network and storage teams, each device can have a custom field called "Owner_Group_Email" that has an email group name. Then you create a single alert where the alert is sent to whatever is in that field.

Problem: Certain devices trigger at certain times because the work they're doing causes them to "run hot"

Reason
During the normal course of business, some systems have periods of high utilization that are completely normal, but also completely above the regular run rate. This could be due to month-end report processing; code compile sequences overnight or on the weekend; or any other cyclical, predictable operation.

The problem here is that the normal threshold for the condition in question is fine, but the "high usage" value is above that, so an alert triggers. But if you set the threshold for that system to the "high usage" level, you will miss issues that are important but often lower than the higher threshold.

Solution
Rather than triggering a threshold on a set value - even if it is set per device as described earlier - you can use the monitoring data to your advantage. Remember, monitoring is not an alert or page, nor is it a blinky dot on a screen. Monitoring is nothing more (or less) than the regular, steady, ongoing collection of a consistent set of metrics from a set of devices. All the rest - alerts, emails, blinky dots and more - is the happy byproduct you enjoy when you do monitoring correctly.

If you've been collecting all that data, why not analyze it to see what "normal" looks like for each device? This is called a "baseline" and it reflects not just an overall average, but also the normal run rate per day and even per hour. If you can derive this "baseline" value, then your alert trigger can go from, "Alert when CPU % utilization is >= <some fixed value>," to, "Alert when CPU % utilization is >= 10% over the baseline for this time period."

IT pros tried these weird monitoring tricks and the results will shock you!
When monitoring engineers implement and use the capabilities of their monitoring solutions to the fullest, the results are liberating for all parties. Alerts become both more specific and less frequent, which gives teams more time to actually get work done. This in turn causes those same teams to trust the alerts more and react to them in a timely fashion, which benefits the entire business. Best of all, everyone experiences the true value that good monitoring brings and starts engaging us monitoring engineers to create alerts and build insight that helps stabilize and improve the environment even more.

"Monitoring Team Saved This Company $$$" isn't some fake headline designed to get clicks. With a little work, it can be the truth for every organization.

For even more alerting insights, check out the latest episode of SolarWinds Lab here - what happens at 22:17 will blow you away!

More Stories By Leon Adato

Leon Adato is a Head Geek and technical evangelist at SolarWinds and is a Cisco® Certified Network Associate (CCNA), MCSE and SolarWinds Certified Professional (he was once a customer, after all). His 25 years of network management experience spans financial, healthcare, food and beverage, and other industries.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Latest Stories
CloudEXPO has been the M&A capital for Cloud companies for more than a decade with memorable acquisition news stories which came out of CloudEXPO expo floor. DevOpsSUMMIT New York faculty member Greg Bledsoe shared his views on IBM's Red Hat acquisition live from NASDAQ floor. Acquisition news was announced during CloudEXPO New York which took place November 12-13, 2019 in New York City.
Blockchain has shifted from hype to reality across many industries including Financial Services, Supply Chain, Retail, Healthcare and Government. While traditional tech and crypto organizations are generally male dominated, women have embraced blockchain technology from its inception. This is no more evident than at companies where women occupy many of the blockchain roles and leadership positions. Join this panel to hear three women in blockchain share their experience and their POV on the futu...
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
Docker and Kubernetes are key elements of modern cloud native deployment automations. After building your microservices, common practice is to create docker images and create YAML files to automate the deployment with Docker and Kubernetes. Writing these YAMLs, Dockerfile descriptors are really painful and error prone.Ballerina is a new cloud-native programing language which understands the architecture around it - the compiler is environment aware of microservices directly deployable into infra...
Apptio fuels digital business transformation. Technology leaders use Apptio's machine learning to analyze and plan their technology spend so they can invest in products that increase the speed of business and deliver innovation. With Apptio, they translate raw costs, utilization, and billing data into business-centric views that help their organization optimize spending, plan strategically, and drive digital strategy that funds growth of the business. Technology leaders can gather instant recomm...
In an age of borderless networks, security for the cloud and security for the corporate network can no longer be separated. Security teams are now presented with the challenge of monitoring and controlling access to these cloud environments, at the same time that developers quickly spin up new cloud instances and executives push forwards new initiatives. The vulnerabilities created by migration to the cloud, such as misconfigurations and compromised credentials, require that security teams t...
Serverless Architecture is the new paradigm shift in cloud application development. It has potential to take the fundamental benefit of cloud platform leverage to another level. "Focus on your application code, not the infrastructure" All the leading cloud platform provide services to implement Serverless architecture : AWS Lambda, Azure Functions, Google Cloud Functions, IBM Openwhisk, Oracle Fn Project.
AI and machine learning disruption for Enterprises started happening in the areas such as IT operations management (ITOPs) and Cloud management and SaaS apps. In 2019 CIOs will see disruptive solutions for Cloud & Devops, AI/ML driven IT Ops and Cloud Ops. Customers want AI-driven multi-cloud operations for monitoring, detection, prevention of disruptions. Disruptions cause revenue loss, unhappy users, impacts brand reputation etc.
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...
Serverless Computing or Functions as a Service (FaaS) is gaining momentum. Amazon is fueling the innovation by expanding Lambda to edge devices and content distribution network. IBM, Microsoft, and Google have their own FaaS offerings in the public cloud. There are over half-a-dozen open source serverless projects that are getting the attention of developers.
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. DevOpsSUMMIT at CloudEXPO expands the DevOps community, enable a wide sharing of knowledge, and educate delegates and technology providers alike.