SYS-CON MEDIA Authors: Elizabeth White, Yeshim Deniz, Roger Strukhoff, Jason Bloomberg, Pat Romanski

Blog Feed Post

The Devil’s in the Details: Understanding Event Correlation

The Devil's in the Details: Understanding Event CorrelationCorrelation and root-cause diagnosis have always been the holy grail of IT performance monitoring. Instead of managing a flood of alerts, correlation and root-cause diagnosis help IT administrators determine where the real cause of a problem lies and work to resolve this quickly, so as to minimize user impact and business loss.

However, all this is not as simple as it sounds! There has always been confusion around event correlation. Terms like event storms, false alerts, root-cause, correlation, analytics, and others are used by vendors with reckless abandon. The result for customers can be a a lot of confusion.

As a result, I’ve always liked some of the best practice guidance around monitoring and event management. Event management’s objectives – detect events, make sense of them and determine the appropriate control action – can be a good way to understand these concepts, and breaking things down along these lines can help us understand what can be a complex subject.

Data Collection and Detecting Events

This is exactly what it sounds like. Monitors need to collect all sorts of data: From devices, applications, services, users and more. This data can be collected in many ways, by many tools and/or components.

So, the first questions you’ll need to ask include “what are we trying to achieve?” and “what data do we need to collect?” Most customers ask these questions based on their individual perspective – a device, a technology silo, a supporting IT service, a customer-facing IT service or perhaps even a single user.

This perspective can result in a desire for a LOT of monitoring data, where the customer is attempting to “cover all bases” by collecting all the data they can get their hands on – “just in case.” Freeware or log monitoring can be simple, cost-effective ways of collecting large amounts of raw data.

But beware of the perception that data collection is where most of your monitoring costs are. In truth, more data is not necessarily a good thing and data collection is not where the real costs to your organization lie, in any case. To determine real causes of performance problems, you’ll have to balance your desire for fast and inexpensive data collection with your needs for making sense of events.

Making Sense of Events with Event Correlation

Root-Cause Analysis

This is where processing and analyzing the data that you’ve collected occurs, and where you ask questions like “how frequently should we collect data?” “what format does the data need to be in?” and “what analysis do we need to perform?”

For example, do you just require information about what’s happening right now, or will you require history (e.g. to identify trends, etc.)? How granular does the collected data need to be, meaning, what are you going to do with it (e.g. identify a remediation action, etc.)? Putting some thought into your monitoring objectives is an important element of determining a data collection strategy.

Most monitoring products today provide some level of formatting and reporting of monitoring data in the form of charts and graphs. This is where the use of terms such as “root-cause analysis” or “correlation” are often used. The key question here is who is doing the analysis and how. Relying on highly skilled experts to interpret and analyze monitoring data is where the real costs of monitoring come from, and this is where the confusion really begins – event correlation, or, making sense of events.

Approaches to Event Correlation

Most customers assume that when they hear “root-cause analysis” or “correlation” there is some level of automation occurring, but relying on your IT staff to interpret log files or graphs is, clearly, manual analysis or correlation. Manual correlation is time-consuming, labor-intensive, requires expertise, and is not scalable as your infrastructure grows. Herein lies the need for monitoring tools that automate this process.

There are many common approaches to correlation:

Rule-Based Correlation
A common and traditional approach to event correlation is rule-based, circuit-based or network-based. These forms of correlation involve the definition of how events themselves should be analyzed, and a rule-base is built for each combination of events. The early days of network management made use of many of these solutions. As IT infrastructures have evolved, the amount of data collected and the effort required for building rules to account for every possible event combination makes this approach very cumbersome. The challenge with this approach is that you must maintain the rule-base, and with the dynamic nature of today’s environments this is becoming increasingly difficult.

History-Based Correlation
Another approach is to learn from past events. Terms like “machine learning” and “analytics” have been used for this approach. What’s common is learning behavior from past events, and if these patterns re-occur you can quickly isolate where the issue was the last time it was observed.

These approaches are independent of the technology domain, so no domain knowledge is needed. This may limit the level of detail that can be obtained, and if the patterns have not been learned from experience, then no correlation will take place. The drawback, of course, is that when problems occur in in the software layers, many of the event patterns are new. Furthermore, the dynamicity of today’s environments makes it less likely that these problem patterns will reoccur.

Domain-Based Correlation
These approaches use terms like “embedded correlation.” This approach does not use rules per se, but organizes the measurement data using layered and topology-based dependencies in a time-based model. This enables the monitored data to be analyzed based on dependencies and timing of the events, so the accuracy of the correlation improves as new data is obtained.

The advantage of this approach is that users can get very specific, granular, actionable information from the correlated data without having to maintain rule bases or rely on history. And since virtual and cloud infrastructures are dynamic, the dependencies (e.g. between VMs and physical machines) are dynamic. So the auto-correlating tool must be able detect these dynamic dependencies to use them for actionable root-cause diagnosis.

The Devil’s Often in the Details
How far should root-cause diagnosis go? This depends on the individual seeking to use the monitoring tool. For some, knowing that the cause of slowness is the high CPU usage of a Java application may be sufficient; they can simply pass the problem on to a Java expert to investigate. On the other hand, the Java expert may want to know which thread and which line of code within the application is causing the issue. This level of diagnosis is desired in real-time, but often, the experts may not be at hand when a problem surfaces. Therefore, having the ability for the monitoring tool to go back in time and present the same level of detail for root-cause diagnosis is equally important.

The level of detail can be the difference between an actionable event and one that requires a skilled IT person to further investigate.

Determine the Appropriate Control Action (Automated IT Operations)

This is where some organizations are focusing, sometimes with limited-to-no evaluation of the monitoring environment. Many IT operations tasks can be automated with limited concern for monitoring, such as automating the provisioning process or request fulfillment.

But if your goal is to automate remediation actions when issues arise, this will involve monitoring. The event management process tends to trigger many processes such as Incident, Problem, Change, Capacity and others. But before you automate remediation tasks, you’ll need to have a high degree of certainty that you’ve correctly identified the root-cause.

The level of detailed diagnostics is relevant here, since without specific detail you may only be able to automate very simple remediation actions (re-start a server, etc.).

As you begin to populate operational remediation policies (also sometimes called rules), you will need to ensure that you can effectively maintain these policies. Therefore, rule-based correlation approaches can come with risk. Failure to maintain the correlation rules can obsolete the policy rules. Solutions that can correlate to a high degree of accuracy, as well as eliminate or simplify correlation maintenance, can be an advantage here.

Automated remediation can be a more significant driver of cost savings than simple data collection, but requires us to make sense of events before effectively achieving this goal.

The Future of Root-Cause Analysis & Event Correlation

There’s no question that with the emergence of new technologies such as containers, microservices, IoT and big data that the monitoring world will need to continue to keep pace with complexity.

Advances in artificial intelligence and analytics will surely drive continued improvements in monitoring, and we hear a lot about advancements in these areas. But remember, we’ve been down this road before. If you do not understand how the monitor will work, or if it seems too good to be true, be sure and test it in your environment.

The increasing reliance of the business on IT services indicates a likelihood that the need for correlation intelligence that can pinpoint the cause of an issue will increase in importance over time.

So, if a solution  touts the benefits of root-cause analysis and also provides you with a “war room” at the same time, or promises autonomic IT operations without explaining how it will get to an actionable diagnosis, don’t forget…

…the devil’s in the details.

Learn about automated event correlation and root-cause diagnosis in eG Enterprise »

The post The Devil’s in the Details: Understanding Event Correlation appeared first on eG Innovations.

Read the original blog entry...

More Stories By John Worthington

John Worthington is the Director of Product Marketing for eG Innovations, a global provider of unified performance monitoring and root-cause diagnosis solutions for virtual, physical and cloud IT infrastructures. He is an IT veteran with more than 30 years of executive experience in delivering positive user experiences through innovative practices in information technology such as ITSM and ITIL.

As CEO and Principal of MyServiceMonitor, LLC, he assisted clients in effectively adapting service lifecycle processes by leveraging eG Enterprise. He then went on to assignments with ThirdSky and VMware, utilizing industry certifications including ITIL Expert and DevOps.

John has more than a decade of experience helping customers transform IT operations to IT-as-a-Service operating models. He participates in industry forums, client and analyst briefings and provides thought leadership to eG Innovations, customers and partners.

Latest Stories
The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get tailored market studies; and more.
Tapping into blockchain revolution early enough translates into a substantial business competitiveness advantage. Codete comprehensively develops custom, blockchain-based business solutions, founded on the most advanced cryptographic innovations, and striking a balance point between complexity of the technologies used in quickly-changing stack building, business impact, and cost-effectiveness. Codete researches and provides business consultancy in the field of single most thrilling innovative te...
CloudEXPO has been the M&A capital for Cloud companies for more than a decade with memorable acquisition news stories which came out of CloudEXPO expo floor. DevOpsSUMMIT New York faculty member Greg Bledsoe shared his views on IBM's Red Hat acquisition live from NASDAQ floor. Acquisition news was announced during CloudEXPO New York which took place November 12-13, 2019 in New York City.
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, discussed some of the security challenges of the IoT infrastructure and related how these aspects impact Smart Living. The material was delivered interac...
Atmosera delivers modern cloud services that maximize the advantages of cloud-based infrastructures. Offering private, hybrid, and public cloud solutions, Atmosera works closely with customers to engineer, deploy, and operate cloud architectures with advanced services that deliver strategic business outcomes. Atmosera's expertise simplifies the process of cloud transformation and our 20+ years of experience managing complex IT environments provides our customers with the confidence and trust tha...
In his session at 23rd International CloudEXPO, Raju Shreewastava, founder of Big Data Trunk, will provide a fun and simple way to introduce Machine Leaning to anyone and everyone. Together we will solve a machine learning problem and find an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (, a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/busine...
CloudEXPO has been the M&A capital for Cloud companies for more than a decade with memorable acquisition news stories which came out of CloudEXPO expo floor. DevOpsSUMMIT New York faculty member Greg Bledsoe shared his views on IBM's Red Hat acquisition live from NASDAQ floor. Acquisition news was announced during CloudEXPO New York which took place November 12-13, 2019 in New York City. Our Silicon Valley 2019 schedule will showcase 200 keynotes, sessions, general sessions, power panels, and...
ShieldX's CEO and Founder, Ratinder Ahuja, believes that traditional security solutions are not designed to be effective in the cloud. The role of Data Loss Prevention must evolve in order to combat the challenges of changing infrastructure associated with modernized cloud environments. Ratinder will call out the notion that security processes and controls must be equally dynamic and able to adapt for the cloud. Utilizing four key factors of automation, enterprises can remediate issues and impro...
Intel is an American multinational corporation and technology company headquartered in Santa Clara, California, in the Silicon Valley. It is the world's second largest and second highest valued semiconductor chip maker based on revenue after being overtaken by Samsung, and is the inventor of the x86 series of microprocessors, the processors found in most personal computers (PCs). Intel supplies processors for computer system manufacturers such as Apple, Lenovo, HP, and Dell. Intel also manufactu...
When you're operating multiple services in production, building out forensics tools such as monitoring and observability becomes essential. Unfortunately, it is a real challenge balancing priorities between building new features and tools to help pinpoint root causes. Linkerd provides many of the tools you need to tame the chaos of operating microservices in a cloud native world. Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. I...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. DevOpsSUMMIT at CloudEXPO expands the DevOps community, enable a wide sharing of knowledge, and educate delegates and technology providers alike.
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
Cloud-Native thinking and Serverless Computing are now the norm in financial services, manufacturing, telco, healthcare, transportation, energy, media, entertainment, retail and other consumer industries, as well as the public sector. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that pro...
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...