SYS-CON MEDIA Authors: Jason Bloomberg, Elizabeth White, Zakia Bouachraoui, Andy Thurai, Liz McMillan

Blog Feed Post

Our recent Service Outage


We want to apologize for the Service Outage that happened on Thursday 7/31 starting at 6:30PM UTC. We caused you a lot of trouble and we are really sorry!

After digging into our logs, we reconstructed the series of events:

It started with poor database performance around 6:30PM UTC, which resulted in a growing backlog of events in our Sidekiq queues. As a result, we hit the memory limit of our Redis instance. This caused dropped jobs, since Sidekiq wasn’t able to enqueue more jobs.

After we noticed that our Redis was completely full, we started discarding some jobs to allow Sidekiq to connect some new workers. But this didn’t resolve the issue. Sidekiq grabbed some jobs but was still unable to process them. They hung during the database operations. The majority of the builds in the queue were log updates, which are INSERT statements in our log table. Browsing through the query list in postgres revealed a lot of hanging INSERT statements. We had to terminate all hanging queries to allow Postgres to accept new queries. This helped to resolve the issue.

What caused the outage?

There were multiple failures happening which caused the long outage.

  1. Our monitoring/alerting failed. We use Librato to visualize key infrastructure metrics. Librato is also responsible to observe the metrics and alert PagerDuty if some key metrics are exceeding thresholds. Due to a configuration error, we did not send all metrics to Librato and, therefore, couldn’t receive alerts from PagerDuty on our phones. This caused us to notice the incident 45 minutes after it began. About 40 minutes into the incident, NewRelic began to trigger PagerDuty as more key metrics started to exceed thresholds. After receiving PagerDuty alerts from NewRelic, we immediately started taking actions. We added more alerts to Librato, which would fire up PagerDuty if key metrics are missing. In addition, we adjusted the thresholds to include not only upper boundaries, but also lower boundaries and fire up alerts if metrics exceed or undercut these thresholds.

  2. Postgres couldn’t process INSERT/UPDATE statement. We pushed Heroku’s monitoring data into NewRelic and Librato, and from our data point of view nothing looked odd. We are currently in contact with Heroku to get more data to figure out what happened under the hood.

  3. We thought this was an issue with Sidekiq not being able to connect to Redis, but discovered the Postgres issue after we resolved the memory issue in Sidekiq. This resulted in a longer outage. This was a human error during debugging.

Again, we are sorry for causing issues on your side.

Ben from the Codeship


  • Bad database performance (writes got stuck)
  • Log update queue gets filled up
  • Redis memory full
  • Free memory in Redis
  • Terminate all currently running database queries
  • Up and running again

Status Update Timeline

Resolved – Builds continue running fine. We’ll keep monitoring and write a post-mortem! Jul 31, 2014 – 9:15PM UTC

Monitoring – The builds are running fine again. We will keep monitoring. Jul 31, 2014 – 8:49PM UTC

Update – We’ve resolved our database issue. We are currently restarting our build infrastructure to resume work on the latest builds. Jul 31, 2014 – 8:40PM UTC

Identified – We’ve traced the issue to our database and are currently looking to fix the issue. Jul 31, 2014 – 8:21PM UTC

Update – We’re having memory issues with our queuing system and are working on a fix. Jul 31, 2014 – 7:46PM UTC

Investigating – We are currently seeing problems with builds not running on our test servers. We are investigating and will keep you up-to-date. Jul 31, 2014 – 7:39PM UTC

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

Latest Stories
Sold by Nutanix, Nutanix Mine with Veeam can be deployed in minutes and simplifies the full lifecycle of data backup operations, including on-going management, scaling and troubleshooting. The offering combines highly-efficient storage working in concert with Veeam Backup and Replication, helping customers achieve comprehensive data protection for all their workloads — virtual, physical and private cloud —to meet increasing business demands for uptime and productivity.
The Software Defined Data Center (SDDC), which enables organizations to seamlessly run in a hybrid cloud model (public + private cloud), is here to stay. IDC estimates that the software-defined networking market will be valued at $3.7 billion by 2016. Security is a key component and benefit of the SDDC, and offers an opportunity to build security 'from the ground up' and weave it into the environment from day one. In his session at 16th Cloud Expo, Reuven Harrison, CTO and Co-Founder of Tufin, ...
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., discussed how these tools can be leveraged to develop a lasting competitive advantage ...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
According to the IDC InfoBrief, Sponsored by Nutanix, “Surviving and Thriving in a Multi-cloud World,” multicloud deployments are now the norm for enterprise organizations – less than 30% of customers report using single cloud environments. Most customers leverage different cloud platforms across multiple service providers. The interoperability of data and applications between these varied cloud environments is growing in importance and yet access to hybrid cloud capabilities where a single appl...
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In today's always-on world, customer expectations have changed. Competitive differentiation is delivered through rapid software innovations, the ability to respond to issues quickly and by releasing high-quality code with minimal interruptions. DevOps isn't some far off goal; it's methodologies and practices are a response to this demand. The demand to go faster. The demand for more uptime. The demand to innovate. In this keynote, we will cover the Nutanix Developer Stack. Built from the foundat...
"NetApp's vision is how we help organizations manage data - delivering the right data in the right place, in the right time, to the people who need it, and doing it agnostic to what the platform is," explained Josh Atwell, Developer Advocate for NetApp, in this interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Cloud computing is certainly changing how people consume storage, how they use it, and what they use it for. It's also making people rethink how they architect their environment," stated Brad Winett, Senior Technologist for DDN Storage, in this interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...