SYS-CON MEDIA Authors: Elizabeth White, Liz McMillan, William Schmarzo, Yeshim Deniz, Jason Bloomberg

Blog Feed Post

From Scala Unified Logging to Full System ObservabilityPart 3 of 3: Metrics and the Bigger Picture

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 3 in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed them, read Part 1 and Part 2 first.

Ok, logging improvements: great for our team and codebase(!!) but there’s a bigger world out there for monitoring and instrumentation. We’re talking about observability here. It’s a world where we not only have the necessary information for troubleshooting and debugging, but also a detailed understanding of how the running system is performing–with alerts for when things go wrong.

Enter metrics. Metrics help form a well-rounded approach to the monitoring and instrumentation of your systems. They focus on quantitative measurements of success, responsiveness, demand, saturation, etc. If log statements are for debugging your running system, then metrics are for picturing how the system is running and indicating when the system needs attention. Metrics are the system’s heartbeat.

An analogy

When I think of instrumentation, I think of a person (the “system”) with advanced hypertension, undergoing angioplasty to open several clogged blood vessels. After a stressful, harrowing experience in critical care, the team performs the surgery. Fortunately it’s successful, but the patient is weak and has a few weeks’ recovery ahead.

But what if preventative health visits and monitoring (system metrics) had detected his high blood pressure earlier? And what if those earlier results were accompanied by a new diet, exercise, and medication plan? Maybe this patient wouldn’t have needed risky, reactive surgery at all.

Obviously this isn’t a direct correlation but it’s a useful analogy to thinking about the visibility that we have into our own running systems. Do we know the request latency and error rate? The demand for and saturation of given components?

The case for metrics

Now that we’re completing the big picture of monitoring and instrumentation using multiple methods, let’s examine how metrics complement a good log portfolio.

Logs are strings. Strings need to be interpreted. Interpretation typically requires humans. Humans are slow to respond.

Metrics are numbers. Numbers are immediately actionable. Actions can even be taken automatically.

Metrics have metadata. Metadata describes multiple dimensions. Dimensionality exposes additional trends, outliers, and correlations.

How does VictorOps approach metrics?

We are still iterating on our metrics infrastructure and portfolio and through that process we are identifying more KPIs (Key Performance Indicators) for our systems. For quite a while we’ve employed black-box style metrics, a.k.a show me what the customer is experiencing.

Now we’re trying to answer deeper questions, like: “Now that I know that error rate is too high, which Service Level Indicators (SLIs) would help paint a more complete picture of my problem?” Our next step is to explore the white-box SLIs that answer these types of internal system questions. Overall, we want a mixture of black and white-box metrics where the former alert off of customer experience issues and the latter point us to a subset of the system for troubleshooting and auto-remediation efforts.

How do we use metrics?

The goal is to auto-remediate issues that can be safely addressed without human involvement and alert on all remaining issues. So, when instrumentation points to a known problem, that problem is auto-remediated by some script or application and people don’t need to get involved in resolving the problem. This gives us response times well within seconds, and no human could both respond to and remediate the issue in a comparable timespan. For the remainder of scenarios, where human intervention is truly necessary, we want to alert on the problem as early as possible and, ideally, even before it becomes a problem.

Where do dashboards fit?

Alerts and auto-remediation are great, but it’s also helpful to have a visualization of your running system. Well built and intuitive dashboards are great tools for a first responder to troubleshoot and pinpoint a problem by scanning for anomalies. Specifically, we’re talking about dashboards that make anomalies easy to identify without needing a full set of tribal knowledge or a PhD. So, please, be careful what you do with all of that valuable insight into your system. Not all of your metrics must be associated with an alert (or you’ll soon end up with alert fatigue from too much noise). You should aim for intelligent alert thresholds and include them in your dashboards.

Be careful however, as dashboards require a human’s attention in order to be useful…just like logs (at least ones without alerts). It’s both a waste of your time and highly error prone to assume that each person or shift staring at that dashboard will know how to accurately interpret each panel and every time series displayed. This means that care must be taken to not only measure but clearly visualize and indicate the value of your metrics – both via dashboards and alerts.

Don’t just measure ALL the things

Ideally, we want to build out preventative health care that’s not just packed full of instrumentation, but is full of useful indicators of system health–the KPIs. If a metric isn’t used by the team through alerts or dashboards, it’s likely not useful. If a metric doesn’t make it to a dashboard or alert for visibility sake, it will likely become a metric that you have to clean up later. These metrics should be removed, and we should be cognizant (but not fearful) of new metrics falling into the same bucket.

Craving more?

Want to kick it up a few notches? Look into systems that can perform anomaly detection for your teams. This allows you to focus on more complex problems, build auto-remediation, fix newly identified issues, etc.

Want to move into uncharted territory? Give sonification a try, or at least check out this cool podcast from Science Friday: Listening in on scientific data. “When it comes to analyzing scientific data, there are the old standbys, plots and graphs. But what if instead of poring over visuals, scientists could listen to their data—and make new discoveries with their ears?”

Although we won’t dive into the topics of distributed tracing and eventing, it’s important to mention how they can add additional viewports into your system. Tracing breaks down the request path through your system by describing both the path and the timing at each stage. Eventing allows you to perform offline analysis of events that occurred in your system, with full detail. This is similar to logs but intended for computational processing and not human processing.

Wrap it up!

I’ve covered how logs and metrics play huge parts in adding observability to your systems–with tracing and eventing also fitting in to the big picture. Logs provide you with detailed diagnostic information and sometimes a clear enough indication of a problem that you can create alerts off them. Metrics provide you with quantitative insights into the running system and are also an ideal source of alerts. Remember: don’t take on the hard work of adding observability, logs, dashboards,  alerts, and anomaly detection to your platform without making these things truly useful or actionable.

Finally, don’t forget to pay attention to those annoying and often below-the-radar gaps in your observability portfolio; like the logging issue we had in our backend systems at VictorOps. You might find that alleviating these annoyances leads to happier, more enthusiastic teams with a renewed stake in creating better systems.

The post From Scala Unified Logging to Full System Observability
Part 3 of 3: Metrics and the Bigger Picture
appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

Latest Stories
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists examined how DevOps helps to meet the de...
Fact: storage performance problems have only gotten more complicated, as applications not only have become largely virtualized, but also have moved to cloud-based infrastructures. Storage performance in virtualized environments isn’t just about IOPS anymore. Instead, you need to guarantee performance for individual VMs, helping applications maintain performance as the number of VMs continues to go up in real time. In his session at Cloud Expo, Dhiraj Sehgal, Product and Marketing at Tintri, sha...
According to Forrester Research, every business will become either a digital predator or digital prey by 2020. To avoid demise, organizations must rapidly create new sources of value in their end-to-end customer experiences. True digital predators also must break down information and process silos and extend digital transformation initiatives to empower employees with the digital resources needed to win, serve, and retain customers.
In his session at 19th Cloud Expo, Claude Remillard, Principal Program Manager in Developer Division at Microsoft, contrasted how his team used config as code and immutable patterns for continuous delivery of microservices and apps to the cloud. He showed how the immutable patterns helps developers do away with most of the complexity of config as code-enabling scenarios such as rollback, zero downtime upgrades with far greater simplicity. He also demoed building immutable pipelines in the cloud ...
More and more companies are looking to microservices as an architectural pattern for breaking apart applications into more manageable pieces so that agile teams can deliver new features quicker and more effectively. What this pattern has done more than anything to date is spark organizational transformations, setting the foundation for future application development. In practice, however, there are a number of considerations to make that go beyond simply “build, ship, and run,” which changes how...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, will provide an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life ...
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, discussed the best practices that will ensure a successful smart city journey.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
LogRocket helps product teams develop better experiences for users by recording videos of user sessions with logs and network data. It identifies UX problems and reveals the root cause of every bug. LogRocket presents impactful errors on a website, and how to reproduce it. With LogRocket, users can replay problems.
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Data Theorem is a leading provider of modern application security. Its core mission is to analyze and secure any modern application anytime, anywhere. The Data Theorem Analyzer Engine continuously scans APIs and mobile applications in search of security flaws and data privacy gaps. Data Theorem products help organizations build safer applications that maximize data security and brand protection. The company has detected more than 300 million application eavesdropping incidents and currently secu...
Rafay enables developers to automate the distribution, operations, cross-region scaling and lifecycle management of containerized microservices across public and private clouds, and service provider networks. Rafay's platform is built around foundational elements that together deliver an optimal abstraction layer across disparate infrastructure, making it easy for developers to scale and operate applications across any number of locations or regions. Consumed as a service, Rafay's platform elimi...