SYS-CON MEDIA Authors: Liz McMillan, Carmen Gonzalez, Zakia Bouachraoui, Roger Strukhoff, David Linthicum

Blog Feed Post

The Devil’s in the Details: Understanding Event Correlation

The Devil's in the Details: Understanding Event CorrelationCorrelation and root-cause diagnosis have always been the holy grail of IT performance monitoring. Instead of managing a flood of alerts, correlation and root-cause diagnosis help IT administrators determine where the real cause of a problem lies and work to resolve this quickly, so as to minimize user impact and business loss.

However, all this is not as simple as it sounds! There has always been confusion around event correlation. Terms like event storms, false alerts, root-cause, correlation, analytics, and others are used by vendors with reckless abandon. The result for customers can be a a lot of confusion.

As a result, I’ve always liked some of the best practice guidance around monitoring and event management. Event management’s objectives – detect events, make sense of them and determine the appropriate control action – can be a good way to understand these concepts, and breaking things down along these lines can help us understand what can be a complex subject.

Data Collection and Detecting Events

This is exactly what it sounds like. Monitors need to collect all sorts of data: From devices, applications, services, users and more. This data can be collected in many ways, by many tools and/or components.

So, the first questions you’ll need to ask include “what are we trying to achieve?” and “what data do we need to collect?” Most customers ask these questions based on their individual perspective – a device, a technology silo, a supporting IT service, a customer-facing IT service or perhaps even a single user.

This perspective can result in a desire for a LOT of monitoring data, where the customer is attempting to “cover all bases” by collecting all the data they can get their hands on – “just in case.” Freeware or log monitoring can be simple, cost-effective ways of collecting large amounts of raw data.

But beware of the perception that data collection is where most of your monitoring costs are. In truth, more data is not necessarily a good thing and data collection is not where the real costs to your organization lie, in any case. To determine real causes of performance problems, you’ll have to balance your desire for fast and inexpensive data collection with your needs for making sense of events.

Making Sense of Events with Event Correlation

Root-Cause Analysis

This is where processing and analyzing the data that you’ve collected occurs, and where you ask questions like “how frequently should we collect data?” “what format does the data need to be in?” and “what analysis do we need to perform?”

For example, do you just require information about what’s happening right now, or will you require history (e.g. to identify trends, etc.)? How granular does the collected data need to be, meaning, what are you going to do with it (e.g. identify a remediation action, etc.)? Putting some thought into your monitoring objectives is an important element of determining a data collection strategy.

Most monitoring products today provide some level of formatting and reporting of monitoring data in the form of charts and graphs. This is where the use of terms such as “root-cause analysis” or “correlation” are often used. The key question here is who is doing the analysis and how. Relying on highly skilled experts to interpret and analyze monitoring data is where the real costs of monitoring come from, and this is where the confusion really begins – event correlation, or, making sense of events.

Approaches to Event Correlation

Most customers assume that when they hear “root-cause analysis” or “correlation” there is some level of automation occurring, but relying on your IT staff to interpret log files or graphs is, clearly, manual analysis or correlation. Manual correlation is time-consuming, labor-intensive, requires expertise, and is not scalable as your infrastructure grows. Herein lies the need for monitoring tools that automate this process.

There are many common approaches to correlation:

Rule-Based Correlation
A common and traditional approach to event correlation is rule-based, circuit-based or network-based. These forms of correlation involve the definition of how events themselves should be analyzed, and a rule-base is built for each combination of events. The early days of network management made use of many of these solutions. As IT infrastructures have evolved, the amount of data collected and the effort required for building rules to account for every possible event combination makes this approach very cumbersome. The challenge with this approach is that you must maintain the rule-base, and with the dynamic nature of today’s environments this is becoming increasingly difficult.

History-Based Correlation
Another approach is to learn from past events. Terms like “machine learning” and “analytics” have been used for this approach. What’s common is learning behavior from past events, and if these patterns re-occur you can quickly isolate where the issue was the last time it was observed.

These approaches are independent of the technology domain, so no domain knowledge is needed. This may limit the level of detail that can be obtained, and if the patterns have not been learned from experience, then no correlation will take place. The drawback, of course, is that when problems occur in in the software layers, many of the event patterns are new. Furthermore, the dynamicity of today’s environments makes it less likely that these problem patterns will reoccur.

Domain-Based Correlation
These approaches use terms like “embedded correlation.” This approach does not use rules per se, but organizes the measurement data using layered and topology-based dependencies in a time-based model. This enables the monitored data to be analyzed based on dependencies and timing of the events, so the accuracy of the correlation improves as new data is obtained.

The advantage of this approach is that users can get very specific, granular, actionable information from the correlated data without having to maintain rule bases or rely on history. And since virtual and cloud infrastructures are dynamic, the dependencies (e.g. between VMs and physical machines) are dynamic. So the auto-correlating tool must be able detect these dynamic dependencies to use them for actionable root-cause diagnosis.

The Devil’s Often in the Details
How far should root-cause diagnosis go? This depends on the individual seeking to use the monitoring tool. For some, knowing that the cause of slowness is the high CPU usage of a Java application may be sufficient; they can simply pass the problem on to a Java expert to investigate. On the other hand, the Java expert may want to know which thread and which line of code within the application is causing the issue. This level of diagnosis is desired in real-time, but often, the experts may not be at hand when a problem surfaces. Therefore, having the ability for the monitoring tool to go back in time and present the same level of detail for root-cause diagnosis is equally important.

The level of detail can be the difference between an actionable event and one that requires a skilled IT person to further investigate.

Determine the Appropriate Control Action (Automated IT Operations)

This is where some organizations are focusing, sometimes with limited-to-no evaluation of the monitoring environment. Many IT operations tasks can be automated with limited concern for monitoring, such as automating the provisioning process or request fulfillment.

But if your goal is to automate remediation actions when issues arise, this will involve monitoring. The event management process tends to trigger many processes such as Incident, Problem, Change, Capacity and others. But before you automate remediation tasks, you’ll need to have a high degree of certainty that you’ve correctly identified the root-cause.

The level of detailed diagnostics is relevant here, since without specific detail you may only be able to automate very simple remediation actions (re-start a server, etc.).

As you begin to populate operational remediation policies (also sometimes called rules), you will need to ensure that you can effectively maintain these policies. Therefore, rule-based correlation approaches can come with risk. Failure to maintain the correlation rules can obsolete the policy rules. Solutions that can correlate to a high degree of accuracy, as well as eliminate or simplify correlation maintenance, can be an advantage here.

Automated remediation can be a more significant driver of cost savings than simple data collection, but requires us to make sense of events before effectively achieving this goal.

The Future of Root-Cause Analysis & Event Correlation

There’s no question that with the emergence of new technologies such as containers, microservices, IoT and big data that the monitoring world will need to continue to keep pace with complexity.

Advances in artificial intelligence and analytics will surely drive continued improvements in monitoring, and we hear a lot about advancements in these areas. But remember, we’ve been down this road before. If you do not understand how the monitor will work, or if it seems too good to be true, be sure and test it in your environment.

The increasing reliance of the business on IT services indicates a likelihood that the need for correlation intelligence that can pinpoint the cause of an issue will increase in importance over time.

So, if a solution  touts the benefits of root-cause analysis and also provides you with a “war room” at the same time, or promises autonomic IT operations without explaining how it will get to an actionable diagnosis, don’t forget…

…the devil’s in the details.

Learn about automated event correlation and root-cause diagnosis in eG Enterprise »

The post The Devil’s in the Details: Understanding Event Correlation appeared first on eG Innovations.

Read the original blog entry...

More Stories By John Worthington

John Worthington is the Director of Product Marketing for eG Innovations, a global provider of unified performance monitoring and root-cause diagnosis solutions for virtual, physical and cloud IT infrastructures. He is an IT veteran with more than 30 years of executive experience in delivering positive user experiences through innovative practices in information technology such as ITSM and ITIL.

As CEO and Principal of MyServiceMonitor, LLC, he assisted clients in effectively adapting service lifecycle processes by leveraging eG Enterprise. He then went on to assignments with ThirdSky and VMware, utilizing industry certifications including ITIL Expert and DevOps.

John has more than a decade of experience helping customers transform IT operations to IT-as-a-Service operating models. He participates in industry forums, client and analyst briefings and provides thought leadership to eG Innovations, customers and partners.

Latest Stories
When you're operating multiple services in production, building out forensics tools such as monitoring and observability becomes essential. Unfortunately, it is a real challenge balancing priorities between building new features and tools to help pinpoint root causes. Linkerd provides many of the tools you need to tame the chaos of operating microservices in a cloud native world. Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. I...
Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
Kubernetes as a Container Platform is becoming a de facto for every enterprise. In my interactions with enterprises adopting container platform, I come across common questions: - How does application security work on this platform? What all do I need to secure? - How do I implement security in pipelines? - What about vulnerabilities discovered at a later point in time? - What are newer technologies like Istio Service Mesh bring to table?In this session, I will be addressing these commonly asked ...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
Blockchain has shifted from hype to reality across many industries including Financial Services, Supply Chain, Retail, Healthcare and Government. While traditional tech and crypto organizations are generally male dominated, women have embraced blockchain technology from its inception. This is no more evident than at companies where women occupy many of the blockchain roles and leadership positions. Join this panel to hear three women in blockchain share their experience and their POV on the futu...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
"NetApp's vision is how we help organizations manage data - delivering the right data in the right place, in the right time, to the people who need it, and doing it agnostic to what the platform is," explained Josh Atwell, Developer Advocate for NetApp, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
Cloud-Native thinking and Serverless Computing are now the norm in financial services, manufacturing, telco, healthcare, transportation, energy, media, entertainment, retail and other consumer industries, as well as the public sector. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that pro...
DSR is a supplier of project management, consultancy services and IT solutions that increase effectiveness of a company's operations in the production sector. The company combines in-depth knowledge of international companies with expert knowledge utilising IT tools that support manufacturing and distribution processes. DSR ensures optimization and integration of internal processes which is necessary for companies to grow rapidly. The rapid growth is possible thanks, to specialized services an...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Docker and Kubernetes are key elements of modern cloud native deployment automations. After building your microservices, common practice is to create docker images and create YAML files to automate the deployment with Docker and Kubernetes. Writing these YAMLs, Dockerfile descriptors are really painful and error prone.Ballerina is a new cloud-native programing language which understands the architecture around it - the compiler is environment aware of microservices directly deployable into infra...