Blog-Icon_7_100x385

Supermarket Berkshelf Incident Post Mortem

We at Chef believe it is important to conduct public post mortems whenever possible. We recently conducted one around a Supermarket/Berkshelf incident that occurred on May 16, 2016. I was the incident commander for this incident and would like to share both the video and write up.

Video Recording

Write Up

Description

On May 16 we experienced a brief SSL issue between Supermarket and Berkshelf.

Timeline

This incident began at 21:56UTC on Monday, May 16, 2016. It was resolved at 22:49UTC that same day.

**Time to detect**: 13 minutes 21:56UTC - 22:09UTC on Monday, May 16, 2016
**Time to resolve**: 44 minutes 21:56UTC - 22:36UTC on Monday, May 16, 2016
All times UTC
21:56  -   Nell Shamrell-Harrington upgraded 2 of the 4 Supermarket Prod nodes from Supermarket 2.5.2 to Supermarket 2.6.0.  She also upgraded the cookbook versions of oc-omnibus-supermarket and supermarket-omnibus-cookbook
22:09  -   Nell Shamrell-Harrington ran berks install to pull cookbooks from the public Supermarket and received this error:
           OpenSSL::SSL::SSLError: hostname "community-files.opscode.com.s3.amazonaws.com" does not match the server certificate
          She asked in the internal Chef Slack if someone else would run berks install to confirm what she was seeing
22:24  -  Lamont Grandquist confirmed that he was seeing the same error in Travis builds
22:32  -  Nell Shamrell-Harrington declared an incident
22:36  -  Nell Shamrell-Harrington moved the two upgraded Supermarket prod nodes out of the Supermarket prod ELB and confirmed that she no longer saw the error when running berks install
22:38  -  SaintAardvark in the #chef IRC channel reported SSL issues with running Berks install, Noah Katrowitz mentioned that kisoku (#chef IRC handle) was reporting the same thing
22:39  -  Noah Kantrowitz DM'd Nell Shamrell-Harrington to let her know that users in the Chef IRC channel were reporting issues with berks and Supermarket
22:43  -  Lamont Grandquist reported that Travis runs were working again
22:46  -  Nell Shamrell-Harrington entered #chef IRC
22:47  -  kisoku reported that his CI jobs were working again in #chef IRC
22:50  -  SaintAardvark reported that his Jenkins jobs were working again in #chef IRC
22:49  -  Nell Shamrell-Harrington declared the incident closed

Contributing Factor(s)

The 2.6.0 release of Supermarket included a commit which changed the AWS S3 urls used to access cookbook artifacts in S3 storage. Prior to this change, Supermarket (through the Paperclip plug in) used a hosted-style S3 url. The one for public Supermarket looked like this: https://s3.amazonaws.com/community-files.opscode.com/

The problem was this URL style only worked if an S3 bucket was in N. Virginia. To fix this, we changed our config to use a path-style url like this: https://community-files.opscode.com.s3.amazonaws.com

When this change was merged and deployed, this error appeared when someone attempted to do a berks install using public Supermarket as the cookbook source:

OpenSSL::SSL::SSLError: hostname "community-files.opscode.com.s3.amazonaws.com" does not match the server certificate

This was due to there being “.” in the bucket name “community-files.opscode.com.s3.amazonaws.com.” Although the previous S3 url style worked with dots in the bucket name, it did not work for a path-style url

Stabilization Steps

We had fortunately only upgraded 2 of the 4 prod nodes, so we removed the 2 upgraded nodes from the ELB, then downgraded them back to Supermarket 2.5.2

Impact

For approximately 53 minutes, anyone using berks install saw the SSL error.

Corrective Actions

  • Make S3 url style configurable in Supermarket
  • Make sure staging bucket has similar formatted name to the production bucket
  • Ensure that berks install is part of smoke tests in both staging and production
  • Add documentation around considerations when naming an S3 bucket
  • Investigate adding a monitor that does a simple berks install and executes on a regular basis

Nell Shamrell-Harrington