Announcing Chef Server High Availability and Replication

Today I’m thrilled to announce two new add-ons for Chef Server 12: Chef Server High Availability and Chef Server Replication. These two features are among the most-frequently requested product enhancements and allow customers to geographically distribute highly-available Chef server clusters while maintaining a single view for Chef content – cookbooks, roles, environments, and data bags. These, and other add-ons, can be obtained from the Chef download site.

Chef Server High Availability

In Enterprise Chef 11 and prior, high-availability Chef server clusters could only be reliably built on bare metal, using a shared block device with a technology known as DRBD (distributed replicated block device). DRBD now ships with the base Chef Server 12 product, and today we are introducing a new add-on that will support additional deployment scenarios, such as those in the cloud.

At launch, Chef Server HA supports Amazon Web Services (AWS), because it brings the necessary infrastructure primitives to the table: a block device that can be detached and reattached to the backend Chef servers, as well as a floating (“elastic”) IP. We intend to expand our high-availability support to other popular clustering technologies, like RedHat Cluster Manager, and other clouds, whether public or private, that provide similar kinds of infrastructure primitives. The following diagram illustrates the deployment scenario of Chef Server High Availability in AWS.

Chef Server HA in AWS diagram

There is no high-availability between regions. It is generally considered poor system design to attempt synchronous operations over high-latency links. For that, you likely want to implement Chef Server Replication.

For more information, please see the Chef Server High Availability documentation or skip directly to the section on AWS..

Chef Server Replication

If you have multiple Chef servers in multiple regions (they can be high-availability Chef servers as well), you may be faced with the problem of keeping the content in these servers consistent. That’s where our new Chef Server replication feature comes in. Using this add-on, you designate one of your Chef servers as the primary, and one or more other Chef servers as replicas. Each replica will periodically awaken and synchronize any changed content from the primary: cookbooks, roles, environments, and data bags. Replication can be configured on a per-organization basis. It is also consistent across network partitions.

A typical deployment scenario is to place the primary in one geographic region, and replicas in other geographic regions, as this diagram illustrates.

Chef Server Replication deployment diagram.

For more information on Chef Server Replication, please see Chef Server Replication documentation.

Conclusion

We’re introducing Chef high availability and replication today to meet the demands of larger enterprises that want to build highly-available, multi-region Chef server deployments. We’d be delighted if you’d try out these add-ons, and we welcome your feedback.

Author Julian Dunn

Julian is a product manager at Chef & started his career at the company in professional services. His first experience with Chef was at SecondMarket, a New-York based alternative markets startup. He has fifteen years of systems administration & software development experience at outfits large and small across such diverse sectors as advertising, broadcasting, Internet security and construction. When he's not helping customers, he enjoys good craft beer, indie music, and writing biographies about himself in the third person.

  • aughban

    great stuff! any plans to support object storage as a backend as well as block devices?

    • Julian Dunn

      Not immediately, as we need the relational nature of SQL databases for the operation of the Chef server.

  • Fantastic! For the future, maybe you could support the following use case: Have a Chef Server cluster in each region which is the master for the nodes in that region, but slaves the other regions. If a cluster goes down, make one of the surviving clusters primary for that region until the original master can be restored and resync’d.

    • Julian Dunn

      Yep, that is on the roadmap — though we’d need to figure out what (if anything) to do about client-side failover.

  • In a replication setup, since it’s not multi-master, the node objects are all saved to the master? Or…?

    • As of this first release, we do not synchronize the node objects or clients. It’s content only, with each replica being a “master” for its failure domain. Once we’re content all that is working, we can talk about other scenarios.

      • Thanks for clarifying, Adam.

        Okay, so partial replication. When you say replication, that means something specific to people.

        The reality should maybe be depicted in the diagram above with ‘nodes’ and ‘clients’ items hanging off all of the servers (including of course the master).