SYS-CON MEDIA Authors: Yeshim Deniz, Elizabeth White, Roger Strukhoff, Jason Bloomberg, Pat Romanski

Blog Feed Post

Using Postmortems to Understand Service Reliability

2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and small. Any way you slice it, these outages likely resulted in a lot of time devoted to postmortems.

I want to reflect a bit on why we write postmortems and suggest some things for authors to think about when writing them. I think there’s room for all of us to improve when it comes to gathering information to better plan pro-active fixes before services catch fire.

Why Do We Conduct Postmortems?

Our incident response training docs put it this way: “Effective post-mortem[s] allow us to learn quickly from our mistakes and improve our services and processes for everyone.” The key takeaway for me is that organizations should use postmortems to capture what they learned from an incident. In other words:

  1. Postmortems are an exercise to learn the specifics of why an incident happened and what needs to be done to prevent this incident in the future.
  2. Organizations should try and learn how effective their incident response process is and what areas can be improved.

I think these two points are what are generally talked about when people talk about “Root Cause Analysis and Causal Factors,” and “What Went Well” and “What Didn’t Go Well” in postmortems.

That’s not what I want to talk about here though.

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

For example, in one major incident, postmortems of minor incidents in the same service leading up to it highlighted nothing of concern—until the big incident happened. After it was resolved, the major incident postmortem looked at the “Role of Previous Incidents” and found that all identified immediate and P1 follow-ups were completed or canceled due to changing plans or new information (it’s easy and okay to de-prioritize or not do something if it looks like a single occurrence).

During the time of the minor incidents up until the big incident, there certainly was work going on with regards that particular platform, but I don’t think that anyone would say that the service was in good health! The postmortems for the incidents during this period focused on the immediate issues of the incident—they didn’t capture the health of the service as a whole. As humans, we’re bad at remembering things, so it’s important to look at broader trends to see if there is a recurring issue or not. I think there’s opportunity to level up processes by devoting more attention here when writing a postmortem report.

At PagerDuty, we’re service-owning engineering teams, so we have opinions about the ongoing stability of our teams’ services. When a major incident occurs involving a service, it forces us to think about our judgment of the stability, and whether our opinion about the long-term health has changed because of the incident. If it has, we then re-evaluate our plans to determine whether we need to prioritize large-scope work to improve that service. For a postmortem report, the crucially important thing to remember is that the things we choose not to do as action items are as important to capture as the action items we decide to do.

When looking over postmortem action items, we found that they tend to be very fine-grained and tightly scoped—upgrade this library, add this monitor, and so on. The guidance that floats around for action items timelines reinforces this. But it’s also important to communicate beyond that—needs for large-scoped remedial improvements that are spotted early on are much easier to work into the roadmaps of teams. I think engineering teams, since they’re the people closest to services, often have a lot of internal knowledge and good instincts about the health of services, but don’t always have a good way to share them and to highlight issues that need larger work. By including this information in postmortem reports, it’s an opportunity to be more transparent about these looming vulnerabilities.

The postmortem report is not just for the team conducting it and owning the service—the team prepares the report and conducts the postmortem investigation, but the final report itself is for the whole organization. A good report captures the risks of our current services, and will help Product and Engineering to more proactively prioritize work on services.

Five Questions to Answer During a Postmortem (None of Which Are “Why”)

Someone from outside your team should be able to read your postmortem report and answer these five questions:

  1. How did we view the health of the service involved prior to the incident?
  2. Did this incident teach us something that should change our views about this service’s health?
  3. Was this an isolated and specific bug—a failure in a class of problem we anticipated—or did it uncover a class of issue we did not architecturally anticipate in the service?
  4. Do we think an incident akin to this one will happen again if we don’t take larger systemic action beyond the action items captured here?
  5. Will this class of issue get worse/more likely to happen as we continue to grow and scale the use of the service?

*Bonus question: Was there a previous incident that showed early signs pointing to this one?

I’d expect these usually to be used as introductory text to the “Action Items” the team intends to take, but sometimes “What Went Well” or “What Didn’t Go Well” will be more appropriate.

Additionally, if there are divergent views within the team preparing the report about the questions, that is also something to capture! Uncertainty is a valuable signal.

There are also some things to clarify about what we think we are accomplishing with the action items we are taking.

Ask yourselves, are we:

  1. Dealing with a specific issue immediately in a narrow, targeted way?
  2. Taking action to eliminate what we see as an entire class of potential issues?
  3. Not taking action, because larger efforts are already underway and will rapidly obsolete a targeted fix? (If so, those larger efforts should be called out!)
  4. Not taking significant action because we don’t think it’s justified?

Learning more from and communicating better with postmortems will help you improve services and reduce the number and severity of incidents you encounter. We all want fewer major incidents and more sleep, and we can have that if we make sure we’re learning all we can from the incidents we do have.

 


Be sure to check out our Postmortem Handbook in which we share lessons learned from the trenches and how you can conduct better postmortems. Or dive directly into the product and try our streamlined postmortem process where you can create incident reports with a single click. Sign up for a free trial to get started!

The post Using Postmortems to Understand Service Reliability appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Intel is an American multinational corporation and technology company headquartered in Santa Clara, California, in the Silicon Valley. It is the world's second largest and second highest valued semiconductor chip maker based on revenue after being overtaken by Samsung, and is the inventor of the x86 series of microprocessors, the processors found in most personal computers (PCs). Intel supplies processors for computer system manufacturers such as Apple, Lenovo, HP, and Dell. Intel also manufactu...
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. DevOpsSUMMIT at CloudEXPO expands the DevOps community, enable a wide sharing of knowledge, and educate delegates and technology providers alike.
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility. As they do so, IT professionals are also embr...
Cloud-Native thinking and Serverless Computing are now the norm in financial services, manufacturing, telco, healthcare, transportation, energy, media, entertainment, retail and other consumer industries, as well as the public sector. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that pro...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
OpsRamp is an enterprise IT operation platform provided by US-based OpsRamp, Inc. It provides SaaS services through support for increasingly complex cloud and hybrid computing environments from system operation to service management. The OpsRamp platform is a SaaS-based, multi-tenant solution that enables enterprise IT organizations and cloud service providers like JBS the flexibility and control they need to manage and monitor today's hybrid, multi-cloud infrastructure, applications, and wor...
Apptio fuels digital business transformation. Technology leaders use Apptio's machine learning to analyze and plan their technology spend so they can invest in products that increase the speed of business and deliver innovation. With Apptio, they translate raw costs, utilization, and billing data into business-centric views that help their organization optimize spending, plan strategically, and drive digital strategy that funds growth of the business. Technology leaders can gather instant recomm...
The Master of Science in Artificial Intelligence (MSAI) provides a comprehensive framework of theory and practice in the emerging field of AI. The program delivers the foundational knowledge needed to explore both key contextual areas and complex technical applications of AI systems. Curriculum incorporates elements of data science, robotics, and machine learning-enabling you to pursue a holistic and interdisciplinary course of study while preparing for a position in AI research, operations, ...
CloudEXPO has been the M&A capital for Cloud companies for more than a decade with memorable acquisition news stories which came out of CloudEXPO expo floor. DevOpsSUMMIT New York faculty member Greg Bledsoe shared his views on IBM's Red Hat acquisition live from NASDAQ floor. Acquisition news was announced during CloudEXPO New York which took place November 12-13, 2019 in New York City.
Industry after industry is under siege as companies embrace digital transformation (DX) to disrupt existing business models and disintermediate their competitor’s customer relationships. But what do we mean by “Digital Transformation”? The coupling of granular, real-time data (e.g., smartphones, connected devices, smart appliances, wearables, mobile commerce, video surveillance) with modern technologies (e.g., cloud native apps, big data architectures, hyper-converged technologies, artificial in...
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get tailored market studies; and more.
Tapping into blockchain revolution early enough translates into a substantial business competitiveness advantage. Codete comprehensively develops custom, blockchain-based business solutions, founded on the most advanced cryptographic innovations, and striking a balance point between complexity of the technologies used in quickly-changing stack building, business impact, and cost-effectiveness. Codete researches and provides business consultancy in the field of single most thrilling innovative te...
Atmosera delivers modern cloud services that maximize the advantages of cloud-based infrastructures. Offering private, hybrid, and public cloud solutions, Atmosera works closely with customers to engineer, deploy, and operate cloud architectures with advanced services that deliver strategic business outcomes. Atmosera's expertise simplifies the process of cloud transformation and our 20+ years of experience managing complex IT environments provides our customers with the confidence and trust tha...