Supermarket HTTPS Redirect Postmortem

Ohai Chefs!

The new community site, Supermarket, was soft-launched in "beta" on Tuesday, June 17. When it was launched, we weren't enforcing HTTPS/SSL for the site. Yesterday, we deployed a change to enforce redirection from HTTP to HTTPS at the application level, which wound up loading a default Nginx page. This meant that the Supermarket was closed! We're sorry about that. Even though the site isn't considered production ("beta"!), we took this outage as seriously as any other.

In this post, I'll explain the background on what happened, how we run the production supermarket application, and what we're going to do to make sure this doesn't happen in the future.

Background

At Chef, we perform a postmortem for any significant production environment outage or incident that results in an engineer being paged. We have a postmortem writeup, with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. We also often make a public blog post when the outage had external customer impact.

The writeup and meeting are normally done internally. However, Supermarket is the community site, it serves the community. The application repository and supporting cookbook are open source, and issues that affect "supermarket.getchef.com" may affect anyone who also runs the application on their internal infrastructure. So for this incident, we conducted the postmortem meeting in the open so the community has a chance to participate as equally as anyone else at Chef.

As stated, the incident that happened is that when we turned on HTTP -> HTTPS redirect in the nginx proxy in front of the supermarket app, a default nginx welcome page was displayed, meaning supermarket was down. What happened?

Root Cause

The supermarket application runs in an AWS account, on three instances, with an RDS database, ElasticCache Redis cache, with an Elastic Load Balancer (ELB) in front of the instances. Supermarket is a Rails app run under the unicorn http server, which listens on a unix domain socket on each of the instances. Nginx is used as a local reverse proxy for each of the application servers. Prior to the change that was deployed, nginx only listened on port 80, and the ELB was configured to point port 443 to port 80. In order to enforce HTTPS, nginx had to be configured to redirect to port 443. For background, see Frank Mitchell’s blog post.

Prior to the change, users could connect to supermarket via https directly, using https://supermarket.getchef.com, but they weren't forced to use it if they came to the site via http://supermarket.getchef.com. The point of the change was to ensure that HTTP is redirected to HTTPS. Chef Operations had a detailed deployment plan ready, and performed the following:

  • ensure oauth callback URLs were updated to https
  • upload the supermarket cookbook to the Chef Server (we use Hosted Chef)
  • run chef-client on the instances

The nginx site configuration was then updated with the following server blocks:

server {
  listen 80;
  # several proxy settings...
  location / {
    if ($http_x_forwarded_proto != 'https') {
      return 301 https://$server_name$request_uri;
    }
  }
}

server {
  listen 443;
  # other settings...
}

The root cause of the issue is that the listener for HTTPS in the ELB was pointed at the instances on port 80, instead of on 443, so the proper redirection wasn't happening.

However, it still isn't clear why the default Nginx welcome page was displayed, as it isn't present in any of the configuration.

Stabilization Steps

Per the timeline, there was about 16 minutes of time where Chef Operations staff discussed the change, how the ELB should be working, and general investigation of the configuration. The same configuration was tested in the staging environment, and from an application perspective, everything appeared okay. However, upon comparison of the AWS ELB between staging and production, it was discovered that the listener wasn't forwarding to port 443 on the instances.

The ELB configuration was updated to use port 443 on the instances, and the issue was immediately resolved – http://supermarket.getchef.com redirected properly to https://supermarket.getchef.com, and the application was loading the site content.

Corrective Action

We have three corrective actions to take as an outcome of the postmortem meeting.

  1. The AWS infrastructure will be managed with Chef, so each environment is automated, and can be compared easily. Currently, the RDS, ElasticCache, and ELB resources are manually managed. We'd like to use Chef's aws cookbook to do this.
  2. The staging configuration will be the same as production. An ELB was created for the staging site to test that the redirect worked at the ELB level. It will be managed with the same automation as production.
  3. The external nagios check will follow 301, and verify it gets a valid HTTPS/SSL response.

Currently there are some small differences in staging. Namely that prior to this change, we didn't have an ELB, and we still are running postgresql and redis on the node. In production, we're using RDS (postgresql) and ElasticCache (redis). Chef's operations and community teams will work on these actions.

The external nagios check is only configured to check that HTTP 200 is returned from http://supermarket.getchef.com. The 301 is expected, and it should be HTTPS. This monitoring only notified our oncall rotation because the HTTP return code was different. We're working on deploying a new monitoring framework based on Sensu, and we'll use Supermarket's built in health endpoint for that.

Conclusion

We're sorry about this issue on the fledgling Supermarket site. During this beta period we plan to get issues like this ironed out. You can follow development on the application itself at https://github.com/opscode/supermarket, and for the cookbook that automates Supermarket's infrastructure at https://github.com/opscode-cookbooks/supermarket. If you experience any problems with the application, report an issue.

Thank you!

Avatar
Joshua Timberman

Joshua Timberman is a Code Cleric at CHEF, where he Cures Technical Debt Wounds for 1d8+5 lines of code, casts Protection from Yaks, and otherwise helps continuously improve internal technical process.