So.. Recently we have had a weird issue at $work with puppet where the puppet runs on quite a lot of servers have appeared to all bunch up and run at the same time. Like this video, but instead of metronomes you have ~1800 servers that were once staggered all deciding to request their catalogs from the puppetmasters at once.
This can been be seen in the network graph for one of the puppetmasters.
Every 30 minutes you have around 900 servers (half the load) requesting its catalog. That many servers requesting at once made for very unhappy masters, and the clients were getting 503s for high load. So we had servers failing runs (which is never awesome).
Despite puppet being used by lots of large enterprises, I couldn't find many writeups about dealing with your puppet agents bunching together and how to restagger them, the majority of posts seem to be related to scaling the masters.
However I did stumble across this handy post from 2012 about staggering the puppet runs with cron - http://www.krzywanski.net/archives/968
I decided to give it a whirl and after updating the script to work with the new CA command I have to say it seems to be working pretty fabulously, though the fix has only been implemented for a few hours so far.
You can see the different in network usage after all the agents were changed to cron.
And in the 48 hour graph you can see that the load is much more balanced now.
After adding the scripts from the previous link to the masters (with puppet of course!), I created the following mini class to apply the changes to the agents:
After testing this was added to the top level class for all our nodes, and it rolled out without any major issues. The only issue I encountered was due to facter returning nil for $::fqdn if the node didn't have a domain set as well as a hostname. This was fixed by editing the fqdn.rb on the node to return the hostname instead of nil, rather hacky.. but it works without needing to change the hostname of a server in production! (This issue appears to be fixed in the later version of facter)
Look at random cron timings with fqdn_rand() at somepoint instead of the node list script, as the number of nodes managed by puppet increases steadily each week and the list is currently regenerated everytime a node is added.
Even though the load has been balanced the masters do still sit with a small amount of load, so adding another master to the haproxy pool would help a lot.