The Debug Files: Bookshelf and the Broken Transfer Encoding

What follows is the story of a bug we encountered during development
of what will become the Chef 11 API server. The story unfolds as we
began to integrate and test Bookshelf, a new component that handles
cookbook file storage. Bookshelf exposes a REST API that behaves just
enough like the AWS S3 API to allow the other parts of Chef that make
use of it to seamlessly switch between Bookshelf and S3.

We began integrating Bookshelf into the Chef Server once we had passing
integration tests written using Erlang’s common_test library. Our
end-to-end API tests passed, so we knew that the server and
the client driving the API tests were able to communicate with
Bookshelf. However, a trial chef-client run failed due to bad
cookbook file data. When we inspected the cookbook file content by
fetching it directly from Bookshelf, we immediately saw a problem. At
first glance, the contents appeared corrupted. There were some random
characters inserted before the correct content and a mysterious
trailing zero had been tacked on. Here’s an example of what we saw
(with most of the content removed for readability):

[sourcecode gutter="false"]
$ curl -isk 'https://open-source-chef.opscode.piab:443/bookshelf/XX/checksum-e87034?AWSAccessKeyId=63d77&Expires=1347080140&Signature=fgUl'

HTTP/1.1 200 OK
Server: nginx/1.0.12
Date: Sat, 08 Sep 2012 04:40:54 GMT
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Connection: keep-alive
x-amz-request-id: g2gCZAATYm9va3NoZWxmQDEyNy4wLjAuMWgDYgAABUNiAAE1lmIAA6Vs
Transfer-Encoding: chunked
Last-Modified: Sat, 08 Sep 2012 04:31:26 GMT
ETag: "6HA0YLpidgIQ8VbK5v041Q=="

6b1
#
# Cookbook Name:: apt
# Recipe:: cacher-client
#
# Copyright 2011, 2012 Opscode, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");

[snip]

file "/etc/apt/apt.conf.d/01proxy" do
    action :delete
    only_if {File.exists?("/etc/apt/apt.conf.d/01proxy")}
  end
end

0
[/sourcecode]

Bookshelf stores its data using a simple binary file format consisting
of a two byte magic number, a 16 byte MD5 checksum, and the file
contents. Our first thought was that we had a bug in the low-level
Bookshelf code that was either not writing the binary format correctly
or misreading it to include extra data. This was quickly ruled out. We
confirmed that the on-disk content was correct.

After inspecting the response more closely, we realized that we were
looking at a transfer-encoding chunked HTTP response. The “random
characters” preceding the content where actually the length of a chunk
in hex. The zero at the end was the last-chunk marker. Bookshelf
supports streaming uploads and downloads (via chunked transfer
encoding) to bound memory use when serving large files. After
confirming that an HTTP 1.0 request using `curl -0` succeeded, we
began to suspect that we had a bug either in our streaming download
code or that there was a problem with the streaming body
implementation in [Webmachine][].

We turned our attention to the suspected Bookshelf Webmachine resource
and Webmachine’s body streaming API. Much reading and tracing of
code followed. Eventually we convinced ourselves we were using the
streaming API correctly and the streaming API itself was behaving
sanely. It was time to inspect a running Bookshelf server to try
and shed new light on our bug.

Enter [redbug][] and the delight of working with Erlang’s dynamic
tracing and debugging tools. Redbug is an easy to use wrapper
around Erlang’s live tracing tools (hat tip to [@slfritchie][] for
recommending redbug and suggesting we add it to our OTP releases for
convenient debugging goodness). We attached a remote shell to the
Bookshelf VM and used redbug to see exactly what was being put on the wire:

(bookshelf@127.0.0.1)5> redbug:start(60000, 20, {gen_tcp, send, [return]}).

This tells redbug to trace all calls to `gen_tcp:send` (of any arity)
and to display the call, its arguments, and the corresponding return
value. The trace will be in place for one minute (60000 ms) or up to
20 hits (these parameters help you from shooting yourself in the foot
and bringing down your VM by accidentally tracing something that is in
a hot path). Here’s the output that we found after repeating our curl
call to fetch the cookbook file:

[sourcecode gutter="false"]
05:04:24 <dead> {gen_tcp,send,
                     [#Port<0.1760>,
                      <<"HTTP/1.0 200 OK\r\nx-amz-request-id: g2gCZAATYm9va3NoZWxmQDEyNy4wLjAuMWgDYgAABUNiAAE7GGIACtrO\r\nTransfer-Encoding: chunked\r\nServer: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue)\r\nLast-Modified: Sat, 08 Sep 2012 04:31:26 GMT\r\nETag: \"6HA0YLpidgIQ8VbK5v041Q==\"\r\nDate: Sat, 08 Sep 2012 05:04:24 GMT\r\nContent-Type: application/octet-stream\r\n\r\n">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"6b1">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"\r\n">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,
                             [#Port<0.1760>,
                              <<"#\n# Cookbook Name:: apt\n# Recipe:: cacher-client\n#\n# Copyright"...>>]}
    
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"\r\n">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"0">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"\r\n">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<>>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
                            
05:04:24 <dead> {gen_tcp,send,[#Port<0.1760>,<<"\r\n">>]}
                            
05:04:24 <dead> {gen_tcp,send,2} -> ok
[/sourcecode]

Reading up on [chunked transfer coding][], we conclude that Bookshelf
and Webmachine are doing the right thing. The `Transfer-Encoding`
header is set correctly, the length of the first chunk is correctly
provided in hex as 6b1 followed by CRLF and the chunk content, and the
last-chunk marker is provided in the same form with zero length.

Comparing the headers put on the wire as seen in the above trace to
those we saw from curl, we noticed that the `Transfer-Encoding:
chunked` header appears twice in the output from curl. In the context
of Chef Server, we have nginx running as a reverse proxy in front of
Bookshelf and the erchef API server. Could nginx be adding an extra
Transfer-Encoding header and be the cause of the incorrect responses?
In short, yes. Google helped us find this [FAQ][] along with other
pages indicating that support for proxying HTTP 1.1 servers was added
to nginx in version 1.1.4. Upgrading our nginx from 1.0.12 to 1.2.3
resolved this problem for us.

Take aways from this small debugging adventure:

**Keep the entire system in mind**

Our initial debugging efforts perhaps focused too closely on the
application server. Had we considered the impact of the front-end load
balancer earlier, it might have taken less time to resolve the
problem.

Our tests failed to catch the issue because they focused too narrowly
on individual systems. We had coverage of Bookshelf in isolation, but
not behind a load balancer. Our API tests were written with a focus on
erchef and while we had verification of the erchef/bookshelf
communication, we were missing the test that would have caught this
earlier.

**Easy to use tracing tools are powerful.**

Being able to inspect what was being put on the wire allowed us to
rule out a class of problems and focus in on the root cause of the bug
more quickly.

**Know your protocols and read your RFCs.**

In this case, familiarity with the chunked transfer coding protocol
provided an important clue and helped us interpret the unexpected
output.

[FAQ]: http://nginx.org/en/docs/faq/chunked_encoding_from_backend.html
[Webmachine]: http://wiki.basho.com/Home.html
[redbug]: https://github.com/massemanet/eper
[@slfritchie]: https://twitter.com/slfritchie
[chunked transfer coding]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1

Author Seth Falcon