Migrating your Chef Server with knife-ec-backup and knife-tidy

Overview

Now is a great time to start thinking about deploying a new Chef Server. Exciting new cloud deployment options such as AWS OpsWorks, Marketplace images for Azure & AWS, as well as high-scale options including Chef Backend for On-Prem HA and the AWS Native Chef Server, make now the perfect time to modernize the backbone of your automation and take advantage of the great new visualizations and reporting capabilities of Chef Automate.

In this article we’ll guide you through the easiest way to migrate your data over to new Chef infrastructure with the least impact to your users.

Know your Tools

knife-ec-backup

Knife-ec-backup creates a full object-based export of your Chef Server (e.g., organizations, users, cookbooks, nodes, clients & keys, roles, acls, etc.) and can restore them to another Chef Server. It was modelled after knife-backup in the sense that it downloads all objects as JSON (except cookbooks, which turn back into cookbook files) but it has been extended to understand many Enterprise Chef (and Chef Server 12) concepts including organizations, users and ACLs.

Unlike file-based backups (chef-server-ctl backup) which are optimized for speed, knife-ec-backup was designed to maximize compatibility and enable admins to clean up errors and bad data after it has been exported. The downside is that knife-ec-backup can be quite a bit slower than other backup strategies. Knife-ec-backup is a Ruby gem that has undergone significant improvements over the past 12 months, make sure you install the latest gem before proceeding.

knife-tidy

Knife-tidy is an essential sidekick tool for Chef Server migrations. Think of it as the Robin to knife-ec-backup’s Batman.

Knife-tidy has several modes of operation:

  • knife tidy backup clean can validate and eliminate errors within a knife-ec-backup data set. Over time Chef Server has become better about validating data, but doesn’t know how to clean up existing objects that were stored within it. This function fixes those objects for you so that they will cleanly import into the latest Chef server release.
  • knife tidy server report will check your server for unused nodes and cookbook versions, which are normally the largest objects in your Chef Server data. It identifies unused cookbook versions by evaluating the run_lists of all the nodes and environment constraints, therefore providing a high degree of safety in its recommendations.
  • knife tidy server clean is like server report but will take action for you, removing the unused nodes and cookbooks.  

Editor’s note: Many improvements have gone into our products to address issues encountered when migrating data from older Chef Servers with less stringent data validation. We encourage you to use the latest versions of our packages wherever possible.

Before you begin your Migration

Installing Prerequisites

  1. Install the required development packages for your platform:
    • Ubuntu 16.04
      $ sudo apt-get -y install gcc postgresql libpq-dev
    • RHEL/Centos/Oracle 6/7
      $ sudo yum -y install gcc postgresql-devel
    • RHEL/Centos/Oracle 6 w/ Enterprise Chef 11
      $ sudo yum -y install https://download.postgresql.org/pub/repos/yum/9.5/redhat/rhel-6-x86_64/pgdg-redhat95-9.5-2.noarch.rpm
      $ sudo yum -y install gcc postgresql-devel postgresql95-devel
  2. Install a separate ruby environment from the embedded ruby packaged with the Chef Server. For simplicity, we are going to use the ChefDK.
    $ curl -L https://chef.io/chef/install.sh | sudo bash -s -- -P chefdk
  3. Install the latest knife-ec-backup and knife-tidy gems into your ruby environment (we are using the ruby embedded in the ChefDK we installed) on your existing Chef server:
    $ sudo /opt/chefdk/embedded/bin/gem install knife-ec-backup -- --with-pg-config=/opt/opscode/embedded/postgresql/<PG_VERSION>/bin/pg_config
    $ sudo /opt/chefdk/embedded/bin/gem install knife-tidy

Validate your new Chef Server

You’ve launched your new Chef Server (or cluster) and it seems to be working. But there are steps you should take before putting that new Chef Server into production:

  1. Have a repeatable Chef Server build & upgrade processes: You may find yourself building and rebuilding your Chef Server several times. Great automation is the key to making that successful by reducing time and frustration. This isn’t simply a one-time activity, but will pay dividends during upgrades, patching cycles, DR exercises, and actual disasters. If you’re not already using a Chef server with a fully automated deployment process, take the time to investigate using Chef to build your Chef Server.
  2. Make sure your Chef Server is really working: Chef Server ships with our full functional testing suite, called Opscode Pedant, which exercises every single API endpoint on the Chef Server. This testing suite is useful for ensuring that all services are working as expected. This command can be run on any Chef Server by executing the chef-server-ctl test command.
  3. Make sure your Chef Server can handle the load: In order to guarantee that your Chef Server will be able to handle your current and future load, we recommend running our load-testing tool against it, called chef-load.
  4. Perform Failure Testing: In cluster scenarios, make sure you can shutdown or terminate any server in the cluster without impacting services. Practice failure recovery scenarios with your team. If you don’t figure this stuff out during the working day, imagine how hard it will be at 4am. :)
  5. Implement monitoring and backups: No Chef infrastructure is complete without automated backup and monitoring procedures in place before launch. For backups, take the following 3 tiers as an example:  
    1. Hourly backups using snapshots  
    2. Daily backups of the filesystem
    3. Weekly full export using knife-ec-backup

And remember that they’re not backups until you’ve validated you can successfully  restore from them!

For Monitoring your Chef Server, there is no better resource than this ChefConf 2016 talk titled Monitoring and Tuning your Chef Server.

Preemptively speed up your migration by pruning your existing Chef Server

The more unused objects (e.g., nodes, clients, cookbooks & organizations) you can identify and remove beforehand, the faster your migration will be and shorter maintenance window you will require.

As mentioned above, knife tidy server clean clean can be used to clean up your existing Chef server before migration. It is strongly recommended that you:

  1. Take backups before making any changes to your Chef server (or work off of a clone or restored backup first)
  2. Try the clean operation in --dry-run mode to see what actions it will take
  3. First work off of a “canary” organization if you have one
  4. Work closely with your Chef Server users so that they know what is happening and how they can quickly recover if anything goes wrong (share the full backup so that anybody can access it and selectively restore objects if needed)
  5. knife tidy server clean may need to make two passes in order to effectively clean out unused cookbooks. That’s because as more stale nodes are removed, the calculated list of needed cookbook versions is likely to shrink.

It is also a good practice to maintain a list of important and needed Chef organizations, and to regularly audit and prune your organization list. Listing and removing organizations is easily accomplished with the chef-server-ctl Org Management commands.

Migration Process

Your migration will happen in two phases:  the Initial Transfer phase, and the Synchronization phase (which consists of many small “catch up” syncs). Those familiar with the unix rsync tool will find this process to be identical in concept.

Initial Transfer

The Initial Transfer phase can be pretty slow during both the backup and restore phases. It’s strongly recommended that you use a shell session manager like tmux orscreen to maintain your session in the event your computer is disconnected. Taking this one step further,you might configure those tools to capture all of the session history, or use a tool like script or tee to do that for you.

Initial backup and restore process:

    1. Ensure you have installed the required prerequisites.
    2. Export all of your Chef Server data:
      • $ /opt/chefdk/embedded/bin/knife ec backup my_backup_destination --with-user-sql --with-key-sql --concurrency 20 -c /etc/opscode/pivotal.rb
      • Note: The default concurrency is 10. Pay special attention to your erchef logs and back off the concurrency number if you notice 502 or 412 errors sent to clients as.you don’t want to overload the Chef Server and effect existing traffic.
    3. Run knife-tidy on the export to resolve all compatibility issues
      • $ /opt/chefdk/embedded/bin/knife tidy backup clean --backup-path my_backup_destination
    4. Import the data on to the new cluster
      • Perform step 1 to install the latest knife-ec-backup and knife-tidy gems
      • $ /opt/chefdk/embedded/bin/knife ec restore my_backup_destination --with-user-sql --with-key-sql -c /etc/opscode/pivotal.rb

Synchronization Phase

This phase will use the exact same steps as the Initial Transfer phase, optionally you can add the --purge flag to knife-ec-backup (but not on restore*) to delete objects in the backup folder that have been deleted on the source Chef Server.

The Synchronization phase is much shorter than a full transfer, because only the changed objects need to be transferred. Schedule periodic syncs using cron during this phase while you plan your final cut-over. The time it takes to complete one full synchronization cycle will determine the length of the maintenance window (Chef server downtime) needed for the cut-over.

*Using –purge while restoring can have unintended effects because of the way Chef Server de-duplicates cookbook files between versions of a cookbook.

Plan and execute the cut-over

The cut-over phase is essentially a Synchronization phase but with two additional steps added:

  1. Before the synchronization, stop all traffic going to your existing Chef server: This is most simply accomplished with a load balancer or firewall rule that can be quickly enabled and disabled.
  2. After synchronization, change the DNS entry for your Chef Server to point at the new server or cluster: As long as SSL certificates match, Chef clients will have no problems communicating with the new server. If the SSL certificates are different and issued by an internal PKI, you can pre-seed the certs from the cert chain in /etc/chef/trusted_certs to all the clients so that they will be trusted.

Pro-tips

Preparing a Ruby Environment

To migrate your Chef Server you will be installing the latest gems for knife-ec-backup and knife-tidy. If you’re using Enterprise Chef 11 or older versions of Chef Server 12 you may have gem dependency conflicts installing into your Chef Server’s ruby environment. To avoid this, we recommend installing ChefDK on your Chef server and install the gems into its ruby environment ( chef gem install knife-ec-backup knife-tidy ).

Parallelism

In order to accelerate the migration of large clusters, it may be possible to parallelize the backup and restore across multiple frontends. Leveraging a shared filesystem such as NFS or EFS also helps by reducing duplication. 

This script provides an example for a parallelized restore operation, spreading the load in similarly sized grouped batches across the number of frontends you wish to utilize.

Monitoring & Adjusting

It’s important to monitor server loads and API response times during the backup and restore phases. Consider adding a dedicated frontend if a backup puts unacceptable strain on your production cluster. To speed restores, consider using more powerful servers or instance sizes temporarily during the migration process. In both backup and restore cases, tiered and clustered systems have a significant performance advantage over standalones.

Migrating from an Open Source Chef 11 Server (vs Enterprise Chef 11)

Prior to Chef Server 12, there were two separate packages for Open Source and Enterprise Chef Servers. As of Chef Server 12, these were unified into a single open source package. If you are migrating data from an Open Source Chef 11 Server, check out our notes for upgrading from Open Source Chef 11.

Merging Chef Server Data Sets or Attempting to Restore to an Existing Server

Update: As of knife-ec-backup 2.4.0, there are no longer restrictions on restoring knife-ec-backup data sets from multiple chef servers with user sql records into a single chef server.

If you’re using the SQL options ( --with-user-sql --with-key-sql ) on restore, then a couple of scenarios are not possible:

  • You cannot perform restores from two different source Chef Servers onto a single destination
  • You cannot add, modify or remove any other users on the destination Chef Server

Large numbers of database key errors are a signal that has happened, and you may need to start the process over.

Irving Popovetsky

Irving leads the Customer Engineering team at Chef