Blog-L_News_3_1283x494

Chef Client 12.14.60 – Escaped defects and corrective actions

Despite our normal testing and processes, the 12.14.60 release of Chef Client included a number of regressions and escaped defects (you may also call them “bugs”).  One of the defects was the yum_repository resource which was added and released in chef-client version 12.14.60. The resource was previously shipped as part of and provided by the yum cookbook.  We will use the specific regressions around the yum_repository resource as a proxy for the release and not dig into the specifics of the other regressions though they will be captured in this incident report.

Allowing any defects to escape is problematic.  The defects that shipped in this release caused some of you significant pain.  I am sorry for the pain this caused you and your teams.

We held an internal post mortem to discuss this incident.  Read below for more information on the issue, its impact, and the corrective actions we are taking to reduce our time to detect and resolve these kinds of issues.

Impact

  • Failed chef-client runs for anyone using a yum_repository resource with a url parameter or a :delete action and chef-client version 12.14.60.

Time to Detect and Resolve

The time to detect and resolve this issue are two important metrics that we track.

  • Time to detect – 70 minutes
      • 18:19 – Chef Client released, 19:09 – GitHub issue 5317 opened.
  • Time to resolve – 6 days, 5 hours, 1 minute
    • 6 hours, 52 minutes
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 15-Sep-2016 01:11 current build of chef-client released that includes the fixes.
    • 5 days, 5 hours, 27 minutes
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 19-Sep-2016 23:46 Chef Client 12.14.77 released
    • 6 days, 5 hours, 1 minute
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 20-Sep-2016 23:20 Doc site includes yum_repository resource

Preventing Similar Incidents

The specific steps we are taking to improve our response to these incidents include:

  • Provide more timely announcements when we know that software we have shipped requires an immediate release to resolve escaped defects or regressions.
  • Automate and improve generation of documentation.
  • Add more tests when migrating providers from cookbooks into Chef Client.
  • Consider moving target release dates to earlier in the week which would allow additional work days during the week to repair any reported issues and avoid delays over the weekend.

Complete Post Mortem

The complete post mortem meeting, including timeline, contributing factors, and more, is available in this GitHub Gist.

Nathen Harvey

As the VP of Community Development at Chef, Nathen helps the community whip up an awesome ecosystem built around the Chef framework. Nathen also spends much of his time helping people learn about the practices, processes, and technologies that support DevOps, Continuous Delivery, and Web-scale IT. Prior to joining Chef, Nathen spent a number of years managing operations and infrastructure for a number of web applications. Nathen is a co-host of the Food Fight Show, a podcast about Chef and DevOps.