Blog-Infrastructure_100x385

Berkshelf v2 Outage Postmortem

The new Supermarket site became the official community site on Monday, July 7th, 2014 at 12:15 PM PDT. Shortly after the cutover we were made aware that the change broke compatibility with Berkshelf v2.x. This interrupted people’s ability to get work done and we are sorry. Even though Berkshelf v2.x is considered deprecated by its authors, we took this outage as seriously as any other.

In this post, I’ll explain the background on what happened and what we’re going to do to make sure this doesn’t happen in the future.

Background

At Chef, we perform a postmortem for any significant production environment outage or incident that results in an engineer being paged. We have a postmortem writeup, with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. We also often make a public blog post when the outage had external customer impact.

The writeup and meeting are normally done internally. However, Supermarket is the community site, it serves the community. The application repository and supporting cookbook are open source, and issues that affect “supermarket.getchef.com” may affect anyone who also runs the application on their internal infrastructure. So for this incident, we conducted the postmortem meeting in the open so the community has a chance to participate as equally as anyone else at Chef.

Berkshelf v2.x was unable to connect to cookbooks.opscode.com and reported errors when trying to do so. Additionaly, Berkshelf v2.x was unable to fetch cookbooks.

Contributing Factors

The Supremarket attempts to enforce security best practices by default. One way this is achieved is by redirecting all HTTP traffic to HTTPS. A bug in the open-uri library causes an exception to be thrown when redirecting a request includes a change in the protocol. Berkshelf v2.x uses the open-uri library and this bug caused Berkshelf to crash when attempting to redirect.

Stabilization Steps

  1. Update cookbooks.opscode.com to allow traffic over HTTP, thus avoiding the redirect.
  2. Recommend Berkshelf v2.x users update their Berksfiles to use https://supermarket.getchef.com/api/v1 as the source.
  3. Patch and release Berkshelf v2.0.18 to resolve the issue.

Corrective Actions

  1. Update the README for the Supermarket cookbook to make note that Berkshelf v2.x support should be tested when making changes. (Christopher Webber)
  2. Ensure that #chef on irc.freenode.net gets updates from http://status.getchef.com. (Christopher Webber)

Conclusion

We are sorry that we broke compatibility with Berkshelf v2.x without notice. We know that you depend on us to give proper notice for breaking changes. We will work hard to prevent this sort of issue in the future.

Thank you!

Christopher Webber