Chef Supermarket Outage Post Mortem

On Thursday, February 26, we had an outage for downloading cookbooks from Supermarket via Berkshelf. The next day, February 27, we held a public post mortem.

If you’d like to see the video of the post mortem, you can view it on Youtube here.

Description

A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.

Timeline

A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.

Time to Detect – 47 minutes
Time to Resolution – 103 minutes

All times are in UTC on February 25, 2016

  • 20:55: Deploy of supermarket 2.4.0 causing the issue is preformed by Robb Kidd (robb), at this time https is still functional
  • 21:31: First user report of issue comes in via #chef on Freenode (irc)
  • 21:35: Issue is reported in Hangops Slack
  • 21:37: Noah Kantrowitz (coderanger) notifies Paul Mooring (pwm) via Chef Sucess Slack
  • 21:42: Nell Shamrell-Harrigton (nell), pwm and robb begin investigating the issue in Chef’s internal Slack
  • 21:46: Incorrect protocol in universe endpoint is discovered by robb
  • 21:53: Config option to disable ssl is pointed out by robb
  • 21:55: Config option to set ssl to true is set by nell
  • 22:03: All nodes have ssl set to true
  • 22:03: Due to self signed cert, all download URLs are unreachable
  • 22:04: All instances get removed from service by ELB (due to cert issues)
  • 22:05: Eric Alwais (eric) updates Chef status page (status.chef.io)
  • 22:10: pwm, robb and nell meet to discuss problem
  • 22:22: robb begins reverting and pinning package version to 2.3.3
  • 22:28: nell directs robb to reverting config changes
  • 22:34: Changes complete, nell verifies problem is clear
  • 22:37: Josh Glass posts all clear to status page
  • 22:37: pwm calls incident resolved

Impact

Users were unable to download cookbooks using Berkshelf or ChefDK for approximately 2 hours.

  • Direct downloads (via web interface, curl, etc.) were functional using https
  • Automated systems (berkshelf, chefdk, etc.) were returning http links based on universe endpoint
  • After setting ssl was enabled, a total outage occured (30 minutes)

Contributing Factor(s)

  • Insufficient monitoring on supermarket (api including /universe and web app)
  • Lack of comprehensive testing on deploys
  • Overly complicated code in omnibus package
  • Lack of production system understanding

Stabilization Step

Changes made to the intial deploy were reverted:

  • Production supermarket was dropped back to version 2.3.3
  • Supermarket version 2.3.3 was locked on frontends
  • Config changes were reverted to the pre-deploy stated and supermarket-ctl reconfigure was run
  • Unsecured (http over port 80) access to cookbook downloads was turned back off (backed out code change)

Corrective Actions

Long Term

  • Document various ssl deployments for supermarket
  • Get Supermarket deployed through automatic provisioning with tests

Immediate

  • Package a 2.4.1 without code changes for http downloads – robb
  • Add an attribute for supermarket version to deploy cookbook – nell
  • Monitor /universe including protocol version returned – nell and pwm
  • Update deployment checklist for explicit test steps – robb
Avatar
Nell Shamrell-Harrington

Nell Shamrell-Harrington is a Principal Software Development Engineer and Community Engineering Lead at Chef. She is also a member of the Habitat core team. She specializes in Open Source, Chef, Ruby, Rails, Rust, Regular Expressions, and Test Driven Development and has traveled the world speaking on these topics. Prior to entering the world of software development, she studied and worked in the field of Theatre. The world of Theatre prepared her well for the dynamic world of creating software applications. In both, she strives to create a cohesive and extraordinary experience.