Bootstrapping Nodes in Bulk

Bootstrapping the chef-client on many nodes in bulk can present a challenge. Using the traditional bootstrapping tools included with the ChefDK (knife bootstrap) to install and register the chef-client on a bulk number (hundreds or thousands) of nodes will result in exhausting resources on the bastion host before your operations are complete, including CPU, memory, and TCP connections. In this article, we’ll cover strategies to make bootstrapping nodes in bulk possible.

Bootstrap Performance Expectations

For an 8 core, 32GB memory Chef Server, a good number to start at is 3000 bootstrap operations over 30 minutes (100 bootstraps per minute). This is a conservative benchmark, so it’s a good place to start a test and tune cycle. With additional Chef Server tuning and fine tuning of existing chef-client runs, you’ll be able to surpass this benchmark. We started seeing many failures around 12000 bootstrap operations over 30 minutes (400 bootstraps per minute). We were able to determine these numbers by doing some simple load testing on Chef Server using chef-load. You can test your own Chef Server using this project too!

Distribute Artifacts

Before installing and running the chef-client, we can stage artifacts required for the bootstrap on all the target machines ahead of time.

Here’s the artifacts that need to be present on each machine:

  • chef-client executable for your target platform
  • client.rb
  • first-boot.json
  • Your organization validator key

The easiest way to do this is to generate simple artifacts, and transmit other artifacts to the target machines. You will want to write your own script, in order to take into account any environment differences you might have in your organization.

Chef Community member James Massardo has an awesome Powershell script that you can use as a template for your own. You’ll want to review and modify the script to fit the needs of your organization and environment. (Thank you James!)

https://github.com/jmassardo/Install-Chef-using-ConfigMgr/blob/master/Create-BootstrapFiles.ps1

There are also similar powershell and bash example scripts available in the Chef Docs

These artifacts should be generated using a script on the target machine, typically in /etc/chef/ on linux, and C:\Chef on windows:

  • The client.rb for the target machine.
  • The first-boot.json file.

Let’s break down these files a bit to gain a better understanding of them.

client.rb

This is the file that is used to configure the chef-client. You can find all of the settings here. https://docs.chef.io/config_rb_client.html

In particular, you’ll want to pay attention to these settings:

  • log_level
  • log_location
  • chef_server_url
  • validation_client_name
  • validation_key
  • node_name
  • ssl_verify_mode
  • environment

The node_name should be generated from your $env:computername or your computer’s hostname.

first-boot.json

This file is used with the -j option to bootstrap the node. It looks like this:

{
  "run_list": ["recipe[foo]"],
  "container_service": {
    "chef-init-test": {
      "command": "/opt/chef/bin/chef-init-test"
    }
  }
}

You can customize this file to suit the needs of your organization. Here’s another example, for a minimal bootstrap:

{
  "run_list": ["recipe[company_base::default]"]
}

chef-client

The chef-client is around 34.5MB in size, and so generating it isn’t desirable. Instead, you should upload the version of the chef-client you plan to deploy to an internal artifact repository, such as Nexus or Artifactory. Then, in your script you can make a simple http request to download the file to the machine. Here’s some example powershell:

$source = "https://artifactory.example.com/windows/2008r2/x86_64/chef-client-$upgrade_version-1.msi"
$destination = "C:\chef\cache\chef-client-$upgrade_version-1.msi"
Invoke-WebRequest $source -OutFile $destination

Organization validator key

The organization validator key is a bit tricky. You can store it in an internal artifact repository, but since this is a private key, you may not want to do that for security reasons. Alternatively, you can transmit the key over powershell remoting, or scp, in a loop to all the machines you plan to bootstrap.

Configure chef-client wrapper cookbook

The default run-list for your nodes should include your own wrapper cookbook for the chef-client cookbook. This cookbook provides several methods to optimally configure the chef-client to make bulk bootstrapping simple.

If you are using a validator, include the delete validator recipe to ensure it is cleaned up after the chef-client registers with the chef-server.

Windows

Set a healthy number of CCRs using these settings. You can tune these settings down or up depending on the performance of your Chef Server.

node.default['chef_client']['task']['frequency'] = 'minute'
node.default['chef_client']['task']['frequency_modifier'] = 60 # 1 hour
node.default['chef_client']['splay'] = 1800 # 30 minutes

include_recipe 'chef-client::task'
include_recipe 'chef-client::delete_validation'

Use Windows Scheduled Tasks to execute the chef-client. This is the most robust way to run the chef-client on Windows. You can find the recipe that creates the scheduled task here.

Linux

Set a healthy number of CCRs using these settings:

node.default['chef_client']['interval'] = 3600
node.default['chef_client']['splay'] = 1800

include_recipe 'chef-client::service'
include_recipe 'chef-client::delete_validation'

For additional settings, refer to the chef-client cookbook on github.

Bootstrapping Strategies

  1. Use a simple run_list for the first run of the chef-client. This ensures that cookbook dependency resolution and cookbook downloads do not take up system resources. This also limits initial failure conditions, if you happen to run more complex cookbooks. Your run list should only include cookbooks essential for a smooth bootstrap.

  2. Run the chef-client with a splay to ensure the clients do not all register with the chef-server at the same time. You can also schedule different times for executing the first run of the chef-client.

  3. Plan the number of bootstrap operations per second. In general, if using an 8 core, 32 GB RAM Chef Server, you should aim for bootstrapping around 3000 nodes in a 30 minute window for your first run. In other words, this means you’ll want to aim for about 1.67 bootstrap operations per second, or 100 bootstrap operations per minute. These numbers represent a baseline — you will be able to get even faster bootstrap operations with larger server resources, server tuning, validatorless bootstrapping, and chef-client optimizations. Getting faster numbers will require a test, monitor, and tune loop.

  4. Write down a matrix for every operating system you are bootstrapping, and test each of them. Some older versions of Windows may not have the right version of Powershell to support your bootstrapping method. More esoteric systems (AIX, Solaris) could require updated packages to support the chef-client. In any case, you’ll want to test each operating system for edge cases before kicking off a run.

Build Confidence

As you ramp up on the number of nodes, build confidence by starting small and ramping up. For instance, if your target is to hit 3000 nodes over 30 minutes, then start with 300 nodes over 3 minutes. This ensures that you can catch error conditions (firewall issues, network failures, dns failures, etc) that are beyond the local scope of the nodes you’re bootstrapping.

Bootstrap In Bulk

Now you’re ready to execute the bootstrap operation on your nodes. Let’s kick it off!

  1. Use the distribution method that your company already prefers to use. Use SCCM, RedHat Satellite, or a custom script (powershell remoting or scp) to schedule a one-time scheduled task on the target machine to initiate the bootstrap (ie. run chef-client -j first-boot.json).

  2. Wait for the bootstrap window. All nodes should execute their scheduled task. If you’re using Chef Automate and have configured the nodes to report their runs, you will see them show up there too.

  3. Manually resolve any bootstrap failures (network offline, etc.) by comparing the expected list of nodes from the nodes that appear in Chef Automate.

Other things to consider

If you run into failures with server-side key generation, you can tune the server to increase the number of key generation workers. In our load testing, our 8 core, 32GB Chef Server was able to hit maximum load, so tuning this setting was unnecessary. But if you’re bootstrapping bulk nodes (more than ~10000 in 30 minutes), you may need this setting.

In /etc/opscode/chef-server.rb:

  • opscode_erchef[‘keygen_cache_workers’] – normally set to ‘:auto’, override if insufficient. :auto is a setting that resolves to a number of half the logical processors available on the Chef Server. This setting determines the number of workers for generating keys for brand new registering clients. Do not set this number higher than the number of logical processors on your Chef Server.
Author David Echols