SYS-CON MEDIA Authors: Elizabeth White, Liz McMillan, William Schmarzo, Yeshim Deniz, Jason Bloomberg

Blog Feed Post

From Scala Unified Logging to Full System ObservabilityPart 1 of 3: Our Original State of Logging

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management.

These days, with infrastructures spanning tens, hundreds, even thousands of running instances, piping a log file into less is no longer an acceptable means of log research and debugging. Instead, sending them to a log aggregation service with products like Sumo, Elastic, or Splunk is commonplace because searchability is king.

Unfortunately, the pursuit of searchability can lead to undesirable side effects like unreadable, inconsistent, and just plain ugly log statements. It invades our codebases with even more custom formatting (on top of string interpolation, etc) that’s not only distracting but also hard to do well with anything less than super-human string-formatting skills. In short, the side effect on log statements can be detrimental.

Logging is just the starting point

First off, let’s make sure to bring a little context to this pursuit. At VictorOps, we use logging as a research and debugging tool. However, logging isn’t and shouldn’t provide the primary heartbeat on your systems. That’s when metrics comes into the picture, which we’ll discuss in Part 3 of this series. Before going there, let’s talk about where we started with logging at VictorOps. We needed some major improvements in this first line of troubleshooting for when things go bad.

As Dave Hahn, a senior SRE from Netflix, recently shared with us, “Be willing to have a problem before you solve it.” In line with that advice, we recently identified multiple problems to solve relating to the research and debugging done through our logs. To top it all off, when I noticed that our logging interfaces were not unified, it became clear that it was time to make both our logging interface and log output great again.

I hope that our experience at VictorOps will give you ideas on how to improve logging at your organization.

The current state: how we use logs at VictorOps

Sumo Logic is our logging platform and we use it heavily throughout the development lifecycle.

There are four primary ways we use logs:

  1. To get visibility on what’s going on during releases. Through logs, we can see if there are errors that persist after a release. If so, there is probably a hole in our altering – some problem that we aren’t yet monitoring.
  2. To create VictorOps incidents for relevant alerts. When we know that a particular log statement is indicating a problem where someone needs to get involved, we hook a scheduled Sumo search up to the VictorOps platform to create an incident out of it. The goal for most of these alerts is to migrate them to a metric based alert instead of basing it on a log statement – more on that in our metrics discussions in Part 3 of this series.
  3. To see how something is working in production. We might want to see how some new feature is working in production, so we’ll review the log statements. The production environment is always the most valuable for feedback because that’s where real customers have real accounts, alerts, users, and escalation policies. It may look fine in staging, but if there is a use case we didn’t test for that shows up in production, you can see the details in a log.
  4. To investigate high-dimensionality information. Organization, user, and API key (and for that matter, any sort of UUID) are all great examples of metadata that typically won’t be available in a metric and thus logging (or eventing) is where we’ll find that data.

We had three main players in our logging

We have a Scala backend that used three different logging frameworks. Some code used the SLF4J logging framework. SLF4J is widely used and provides a rich feature set. Other code within Akka actors used the Akka actor logging, which has a scaled down interface and feature set and is configured to use SLF4J. Some of our Play code used Play’s own logging, which is extremely simplistic, and is also configured to use SLF4J. All of these were configured with the SLF4J native Logback implementation. Here are some details:

SLF4J

SLF4J is likely the most widely used java logging facade with multiple implementations and a massive user base. Performance is dependent solely on your configuration of the appender that you’re using. By default, logback uses a synchronous appender, but you can easily configure an asynchronous appender. A synchronous appender will use the calling thread to actually write the log statement to file/network, whereas an asynchronous appender lightens the processing load on the calling thread by simply handing over the log statement to the appender to write to file/network at some point in the future.

Akka logging

Akka’s actor-based logging is event driven and is easily configured to use SLF4J. In the actor itself, you say log.info(“this message”), and behind the scenes, it sends an event to the system’s event stream and it’s done. This takes up almost no overhead to create that log statement because it goes somewhere else to be written.

Play logging

Play has its own simplistic logger that’s much more stripped down than the Akka logger and by default uses SLF4J. Play offers up to two arguments: the string that you’re logging, and an optional exception. The most recent version (2.6.x) has added support for SLF4J markers.

Why change how we do logging?

Strategy concerns

These concerns have to do with the various strategies taken by these different logging interfaces.

  • Call-site performance: All SLF4J interfaces rely on the caller to provide pre-computed strings and arguments prior to checking if that log level (info, debug, trace, etc.) is enabled. There are simple ways around this, like Play’s interface that uses a by-name argument for the string. This essentially creates an anonymous function that is executed only after the log level has been checked. For example, without by-name arguments, the statement below will require the mkString method to execute on a potentially large collection prior to the info method checking whether or not info level log statements are enabled. log.info(s“Team $team has users: ${users.mkString(separator)}”)
  • Conflicting interfaces: The largest effect of conflicting interfaces is developer confusion and frustration. The next problem is that it leads to incorrect log statements. If logs are to save you when things go awry, then an incorrect log statement is like a carabiner with a broken arm–looks like a useful thing but is completely useless for the intended user. For example, below are the error methods from these three interfaces. Notice how the location of the Throwable argument changes? Now, imagine working in a codebase where all three of these interfaces are being used. A little scary.
    • SLF4J: void error(String msg, Throwable t)
    • Akka: def error(cause: Throwable, message: String): Unit
    • Play: def error(message: ⇒ String, error: ⇒ Throwable)
  • Appender performance: All three of these have configurable backends and appenders, but it is worth noting that any interface you use will need to have its configuration examined. Most default appenders are synchronous and therefore write the log statement to file (or whatever destination) at the call-site. However, this can be changed easily by configuring an asynchronous appender. This clearly improves call-site performance by requiring only the string to be built before asynchronously handing it off to the appender, which will write the statement to file on its own time.

Developer concerns

How did using multiple logging libraries affect the developers?

  • Too many decisions: Choosing between three different loggers for any given class.
  • Conflicting interfaces: From a developer perspective, this causes confusion and requires you to pay more attention to your logger than you really should.
  • Inconsistency: Having more than one logger in a class, which is clearly unnecessary, and having different types inconsistently named, e.g. log and logger.

Functionality needs

What functionality do the developers need for a maintainable codebase and effective log portfolio?

  • Unified interface: A single interface allows you to add new features in one place and enables the power of easily refactoring logging on a large scale.
  • Support for log variables: Extracting specific information from a log statement is easier if it’s been given special formatting. Once standardized, this can be utilized in our Sumo queries.
  • Implicit loggers for utility classes: Utility classes lack their own identity in terms of data flow. Implicitly passing in a logger, which has identifying information from the caller (its class and log variables), provides rich log statements within utility code.
  • Further consistency: This equates to icing on the cake. Things like a very simple Logging trait to standardize the log field name, logger name (used when writing the log statements), and the logger identity (based on log variables).

Up next

Now that we’ve set the stage, in Part 2, we’ll explore how we addressed these concerns in order to make logging great again.

The post From Scala Unified Logging to Full System Observability
Part 1 of 3: Our Original State of Logging
appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

Latest Stories
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists examined how DevOps helps to meet the de...
Fact: storage performance problems have only gotten more complicated, as applications not only have become largely virtualized, but also have moved to cloud-based infrastructures. Storage performance in virtualized environments isn’t just about IOPS anymore. Instead, you need to guarantee performance for individual VMs, helping applications maintain performance as the number of VMs continues to go up in real time. In his session at Cloud Expo, Dhiraj Sehgal, Product and Marketing at Tintri, sha...
According to Forrester Research, every business will become either a digital predator or digital prey by 2020. To avoid demise, organizations must rapidly create new sources of value in their end-to-end customer experiences. True digital predators also must break down information and process silos and extend digital transformation initiatives to empower employees with the digital resources needed to win, serve, and retain customers.
In his session at 19th Cloud Expo, Claude Remillard, Principal Program Manager in Developer Division at Microsoft, contrasted how his team used config as code and immutable patterns for continuous delivery of microservices and apps to the cloud. He showed how the immutable patterns helps developers do away with most of the complexity of config as code-enabling scenarios such as rollback, zero downtime upgrades with far greater simplicity. He also demoed building immutable pipelines in the cloud ...
More and more companies are looking to microservices as an architectural pattern for breaking apart applications into more manageable pieces so that agile teams can deliver new features quicker and more effectively. What this pattern has done more than anything to date is spark organizational transformations, setting the foundation for future application development. In practice, however, there are a number of considerations to make that go beyond simply “build, ship, and run,” which changes how...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, will provide an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life ...
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, discussed the best practices that will ensure a successful smart city journey.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
LogRocket helps product teams develop better experiences for users by recording videos of user sessions with logs and network data. It identifies UX problems and reveals the root cause of every bug. LogRocket presents impactful errors on a website, and how to reproduce it. With LogRocket, users can replay problems.
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Data Theorem is a leading provider of modern application security. Its core mission is to analyze and secure any modern application anytime, anywhere. The Data Theorem Analyzer Engine continuously scans APIs and mobile applications in search of security flaws and data privacy gaps. Data Theorem products help organizations build safer applications that maximize data security and brand protection. The company has detected more than 300 million application eavesdropping incidents and currently secu...
Rafay enables developers to automate the distribution, operations, cross-region scaling and lifecycle management of containerized microservices across public and private clouds, and service provider networks. Rafay's platform is built around foundational elements that together deliver an optimal abstraction layer across disparate infrastructure, making it easy for developers to scale and operate applications across any number of locations or regions. Consumed as a service, Rafay's platform elimi...