Cookbook Dependency API Postmortem

On Tuesday, August 12th at 15:10 UTC, the cookbook dependency API provided by Supermarket became unusable. We are very sorry for this outage and interruption to workflow. In this post, I will explain what happened and the mitigation steps we are taking to prevent this from happening in the future.

Background

At Chef, we perform a postmortem for any significant production environment outage or incident that results in paging an engineer. We have a postmortem writeup with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. We also often make a public blog post when the outage had external customer impact.

This writeup and meeting are normally completed internally. However, because Supermarket is the community tool, we wanted this process to be public. The application repository and supporting cookbook are open source, and issues that affect “supermarket.getchef.com” may affect anyone who also runs the application on their internal infrastructure. So for this incident, we conducted the postmortem meeting in the open so the community has a chance to participate as equally as anyone else at Chef.

Postmortem write-up – Contents from our internal repository where we store postmortem write-ups.
Postmortem meeting – Video of the postmortem meeting

Contributing Factors

The 2.7.1 version of the supermarket cookbook broke the symlinking of .env. This was not initially discovered at the time of the deploy because:

  • The staging server uses a Redis instance in the default location, production does not.
  • The symlink code only gets executed upon a deploy_revision and this was the first time the revision was deployed to production since the cookbook update.

Stabilization Steps

The Chef Operations team manually added an additional symlink for the .env.production and restarted the web server so that the Supermarket application would know where to find the Redis instance.

Corrective Actions

There were a number of corrective actions agreed upon in the meeting. Below is a list of actions, who was assigned and notes if they have already been completed.

  • COMPLETED – Correct way the cookbook handles the .env.production symlink. (Released as part of version 2.7.2 of the supermarket cookbook)
  • Make sure deploys do not generate false alerts. (FullStack – https://trello.com/c/ddg4h6NL)
  • COMPLETED – Create a playbook for using GitHub directly in the case that we need to update the cookbook and the cookbook dependency API is down. (Chef Operations)
  • Make Supermarket throw an exception if Redis is inaccessible. (FullStack – https://trello.com/c/bDXIjdVT)
  • Make /status return a “not ok” status if a service is unreachable. (FullStack – https://trello.com/c/Ho4fHdXy)

Conclusion

We are sorry that the cookbook dependency API was unavailable. We know that you depend on us to keep this service up and running. We will work hard to prevent this sort of issue in the future.

Thank you!