|By Elizabeth A. Nichols, Ph.D||
|February 10, 2014 09:00 AM EST||
As Netuitive's Chief Data Scientist, I am fortunate to work closely with some of the worlds' largest banks, telcos, and eCommerce companies. Increasingly the executives that I speak with at these companies are no longer focused on just detecting application performance anomalies - they want to understand the impact this has on the business. For example - "is the current slowdown in the payment service impacting sales?"
You can think of it as detecting IT operations anomalies that really matter - but this is easier said than done.
Like Needles in a Haystack
When it comes to IT analytics, there is a general notion that the more monitoring data you are able to consume, analyze, and correlate, the more accurate your results will be. Just pile all that infrastructure, application performance, and business metric data together and good things are bound to happen, right?
Larger organizations typically have access to voluminous data being generated from dozens of monitoring tools that are tracking thousands of infrastructure and application components. At the same time, these companies often track hundreds of business metrics using a totally different set of tools.
The problem is that, collectively, these monitoring tools do not communicate with each other. Not only is it hard to get holistic visibility into the performance and health of a particular business service, it's even harder to discover complex anomalies that have business impact.
Anomalies are Like Snowflakes
Compounding the challenge is the fact that no two anomalies are alike. Anomalies that matter have multiple facets. They reflect a composite behavior of many layers of interacting and inter-dependent components. Additionally, they can be cleverly disguised or hidden in a haze of visible but insignificant noise. No matter how many graphs and charts you display on the largest LCD monitor you can find - the type of scalable real-time analysis required to find and expose what's important is humanly impossible.
Enter IT Operations Analytics
Analytics such as statistical machine learning allow us to understand the "normal" behavior of each resource we are tracking - be it a single IT component, web service, application, or business process. Additional algorithms help us find patterns and correlations between the thousands of IT and business metrics that matter in a critical service.
The Shift Towards IT Operations Analytics is Already Happening
This is not about the future. It's about what companies are doing today.
Several years ago thought-leading enterprises (primarily large banks with critical revenue driving services) began experimenting with a new breed of IT analytics platform. These companies' electronic and web facing businesses had so much revenue (and reputation) at stake that they needed to find the anomalies that matter -- the ones that were truly indicative of current or impending problems.
Starting with an almost "blank slate", these forward-thinking companies began developing open IT analytics platforms that easily integrated any type of data source in real time to provide a comprehensive view of patterns and relationships between IT infrastructure and business service performance. This was only possible with technologies that leveraged sophisticated data integration, knowledge modeling, and analytics to discover and capture the unique behavior of complex business services. Anything less would fail, because, like snowflakes, no two anomalies are alike.
The Continuous Need for Algorithm Research
The online banking system at one bank is different than the online system at the next bank. And the transaction slowdown that occurred last week may have a totally different root cause than the one two months ago. Even more interesting are external factors such as seasonality and its effects on demand. For example, payment companies see increased workload around holidays such as Thanksgiving and Mother's Day whereas gaming/betting companies' demand is driven more by factors such as the NFL Playoffs or the World Series.
For this reason, analytics research is an ongoing endeavor at Netuitive - part driven by customer needs and in part by advances in technology. Once Netuitive technology is installed in an enterprise and integrating data collected across multiple layers in the service stack, behavior learning begins immediately. As time passes, the statistical algorithms have more observations to feed their results and this leads to increasing confidence in both anomalies detected and proactive forecasts. Additionally, customer domain knowledge can be layered in to Netuitive's real-time analysis in the form of knowledge bases and supervised learning algorithms. The Research Group at Netuitive works closely with our Professional Services Group as well as directly with customers to regularly review actual delivered alarm quality to tune the algorithms that we have as well as identify new algorithms that would deliver greater value in an actionable timeframe.
Since Netuitive's software architecture allows for "pluggable" algorithms, we can incrementally introduce new analytics capabilities easily, at first in an experimental or laboratory setting and ultimately, once verified, into production.
The IT operations management market has matured over the past two decades to the point that most critical components are well instrumented. The data is there and mainstream IT organizations (not just visionary early adopters) realize that analytics deliver measurable and tangible value. My vision and challenge is to get our platform to the point where customers can easily customize the algorithms on their own, as their needs and IT infrastructure evolve over time. This is where platforms need to get to because of the endless variety of ways that enterprises must discover and remediate "anomalies that matter".
Stay tuned. In an upcoming blog I will drill down on some specific industry examples of algorithms we developed as part of some large enterprise IT analytic platform solutions.