At $work we use glorious puppet for our configuration management needs. The first version of our puppetmaster was a xen virtual machine with 2 cores, 8GB RAM and 60GB diskspace. This was fine for a while. However soon it started struggling under the load as the amount of servers managed increased, so the puppetmaster was upgraded to a Dell R320 with 12 cores, 32GB RAM and two 240GB SSDs in RAID1.
We also started using puppetdb for fact collection (and foreman for reporting, but that is on another server). This setup worked fine for just over a year (system_birth => Thu Jan 23 13:15:23 2014), but again it started struggling under the load of managed hosts (well over 2000 currently and increasing at a steady rate). The load would spike and we would get batches of failed runs due to timeouts.
A plan was made to investigate and setup a loadbalaced multimaster environment for better load management. Two Dell R630's with 32 cores, 128GB RAM and 2x480GB SSDs in RAID1 were provisioned. Plenty of performance for our puppets. The plan was to setup a dedicated haproxy box for the loadbalancer and to use puppet to update the server setting on each node to point to this as the master to use.
However that's not what happened, instead we had what I will affectionately call "puppet failure friday". I'm not entirely sure what happened, since I was asleep at the time.. but it looks like the server swapped, apache hit max clients and then couldn't keep up with the resulting load which climbed rapidly and resulted in puppet agents failing runs due to timeouts or runs stagnating. This is shown in the below diagram.
This therefore meant the multimaster plan was expedited somewhat, as I couldn't get the current master to handle the load at all. Having your puppet master sit at a load average of 200 isn't particularly great for having it do anything useful!
As I couldn't use puppet to set the server on each agent to a new one, I used the existing puppetmaster as the load balancer. We run puppet under apache so I changed the vhost on the existing master to listen on 8150 instead of 8140, haproxy now listens on 8140. The new masters were setup as per the puppetlabs documentation (which is generally awesome, as documentation goes), with alt_dns_names configured.
My haproxy config is:
The newer servers have a higher weight than the existing master, as they are way more awesome in terms of specs. The existing master also runs puppetdb.
The haproxy stats looks like:
The existing master is also the ca master, so the other masters are set to forward any certificate requests to it (on port 8150).
The new masters have also been configured to talk to puppetdb and send reports to theforeman. As we use svn for our manifests and modules, there is currently a cron to rsync changes to the two other masters when the existing master is updated. They will be set to do a svn up everytime their puppet agent runs (like the existing master) and added to our chatops ability to run svn up (modules|manifests) on the master(s) from our internal chatserver.
The setup is handling the current load very well (as one would hope!) and manages fine if you remove a server from the haproxy pool. The only issue is that if you restart httpd on any of the servers (or haproxy itself) you get a handful of servers failing runs due to timouts in the ~10 seconds it takes it to restart!
Future plans: Add another server to the pool to take over the master duties of the existing master (leaving it to just do puppetdb/ca). Possibly move haproxy to it's own server, though that would be another adventure.