Sometimes grandchildren are pesky

When you are building a configuration management system, one of the things you learn quickly is that you are going to hit a ton of edge cases. Many of these edge cases are the result of applications that were not written with automation in mind. As but one example, many applications do not change their exit code when they have an error – making automation very challenging.

We hit one particularly pernicious edge case recently while developing the Chef service provider for Red Hat. Whenever we would call out to /sbin/service to start/stop/restart a service, Chef would block forever waiting to read the output of the command. The bug is not unique to Red Hat. Indeed, we saw a similar bug that we fixed in a much less elegant way with Ubuntu’s CouchDB package. This is that bugs’ story.

The setup

We were using a CentOS 5.2 system to develop the provider. For this particular issue, we used /etc/init.d/gpm restart as our test case – but it appeared to happen with every init script we tried.

Here is a simple test case in Ruby:

#!/usr/bin/ruby
output = IO.popen("/sbin/service gpm restart")
puts "Reading output..."
puts output.read
puts "Would be nice to get here, but never going to happen."

Run that code snippet on any CentOS 5.2 desktop (which will have gpm installed) and you’ll notice that it blocks forever.

When you examine the process table, you’ll notice that the process we spawned (/sbin/service) has gone Zombie:

root     12586  Z+   02:29   0:00 [service] <defunct>

For those of you unfamiliar with Zombies: when a process is spawned, its parent is expected to clean up after it.

That is one messy zombie

Typically, this is done by issuing one of the variations on the wait system call, or catching SIGCHLD. If the child has exited, but the parent has not done its duty in cleaning up after the child yet, the process is marked as a Zombie. If the parent exits, the child will be inherited by init and reaped immediately.

What’s happening?

When a child is forked, it inherits the file handles of its parent, in particular STDIN, STDOUT, and STDERR. You can see this behavior with this test script:

#!/usr/bin/ruby
STDOUT.puts "I am the parent @ #{Process.pid}"
STDERR.puts "I am the parents stderr"
cid = fork do
  STDOUT.puts "I am the child @ #{Process.pid}"
  STDERR.puts "I am the childs stderr"
  exit 42
end
cid, status = Process.waitpid2(cid)
puts "Child #{cid} had exit status #{status.exitstatus}"

All we are doing here is the simplest fork – we spawn a new process, and print to STDOUT and STDERR. The output gets to the terminal because the child is using the same file descriptors as the parent. Because we’re a good unix citizen, we are running waitpid2 in our parent – we care about our children!

When dealing with pipes (or file descriptors), its important to remember one thing:

The read end of a pipe will never issue an end of file if a write end is still open

In the case of our errant init script, the issue comes down to this: one of our children has spawned a grandchild, and that grandchild has inherited one or both of our file descriptors, and neither cleans up properly – they fail to close their end of the Pipe.

This leaves the child zombied, because the grand child has neither closed their end of the pipe or exited – and the parent blocks forever on read.

What do we do about it?

Ideally, everyone would be a good unix citizen and clean up after themselves when they spawn children – in practice, they often aren’t. One way to get around this issue would be to send our output to a temporary file, rather than read directly from a Pipe. The child won’t know the difference, and reading files is not the same as having the read end of a pipe open, we can then read the temp file at our leisure, knowing that if our child is dead, its output is in the file.

#!/usr/bin/ruby
require 'tempfile'

pin = IO.pipe
outfile = Tempfile.new("chef-exec")
errfile = Tempfile.new("chef-exec")

cid = fork
if cid
  pin.last.close
  outfile.close
  errfile.close

  cid, status = Process.waitpid2(cid)
  puts "Child #{cid} has exit status #{status.exitstatus}"

  puts IO.read(outfile.path)
  puts IO.read(errfile.path)

  # Tempfile cleans up automatically when the objects 
  # go out of scope
else
  pin.last.close 
  STDIN.reopen pin.first
  pin.first.close

  STDOUT.reopen outfile
  outfile.close

  STDERR.reopen errfile
  errfile.close

  exec("/sbin/service gpm restart")
end

While this is straightforward, it is a hack. For every process we spawn, we create a pair of temp files (one for STDOUT and one for STDIN)… and, if we need the output, we have to open, read, and close those temp files, then clean up after ourselves. That’s a lot of work for such simple functionality.

A better answer is to flip the order in which we do things just a little bit, and sprinkle a bit of O_NONBLOCK on our pipes. Avoiding the use of IO.popen this time (because we need a bit more control than it provides), here is an example that will not block, and still return all the output:

#!/usr/bin/ruby

require 'fcntl'
require 'io/wait'

pin, pout, perr = IO.pipe, IO.pipe, IO.pipe

cid = fork
if cid 
  [pin.first, pout.last, perr.last].each{ |fd| fd.close }

  pin.last.close

  pout.first.fcntl(Fcntl::F_SETFL, pout.first.fcntl(Fcntl::F_GETFL) | Fcntl::O_NONBLOCK)
  perr.first.fcntl(Fcntl::F_SETFL, perr.first.fcntl(Fcntl::F_GETFL) | Fcntl::O_NONBLOCK)

  cid, status = Process.waitpid2(cid)
  puts "Child #{cid} has exit status #{status.exitstatus}"

  puts pout.first.read if pout.first.ready?
  puts perr.first.read if perr.first.ready?  
else 
  pin.last.close 
  STDIN.reopen pin.first
  pin.first.close

  pout.first.close 
  STDOUT.reopen pout.last
  pout.last.close

  perr.first.close 
  STDERR.reopen perr.last
  perr.last.close

  exec("/sbin/service gpm restart")
end

That code can seem a little magical, so here it is step by step:

  • Creating new pipes for our child’s STDIN, STDOUT and STDERR.
  • Forking our child, and re-opening those file descriptors with the childs end of the pipes.
  • Exec-ing our program.
  • In the parent, we close our end of the pipes that duplicate the ones we gave our child.
  • We set the STDOUT and STDERR pipes to be non-blocking.
  • We wait for our child to exit, implying (through his death) that he has nothing more to say to us on either pipe.
  • Reading from the two pipes, assuming they have data for us.

That’s quite a bit more code than just calling IO.popen! But it does not ever cause your application to block just to read the STDOUT and STDERR of another program, regardless of how badly behaved its children might be. It also avoids the significant overhead of streaming the output to temporary files just to open, read, and close them again.

I want to give a special thanks to the many kind souls who helped us debug this problem (you know who you are – thank you.) In particular, Benjamin Black endured no end of my griping about “how you maybe couldn’t fix it, and the temp file trick wasn’t that bad, right?” and Artur Bergman pointed us in the direction of changing the order you wait for the child in, which was crucial in making things work.

Adam Jacob

Adam Jacob is the CTO and co-founder of Chef.