Who's checking in, the mcollective trick.

By thomas, 17 September, 2014

This keeps coming up so I thought I'd share one trick we've used to figure out if there are stale nodes out there. These are nodes that are failing to update for various reasons that won't be reported in your reporting mechanism. One of the common causes is an expired or revoked certificate. The agent never gets far enough to report a failure.

In these cases, provided mcollective was running and configured on the node, you may still see the node in mcollective and think everything is fine. If you have a small enough implementation you can probably track down these hosts one by one, but this is how we do it with a few thousand nodes. I'm assuming you are configuring mcollective from puppet (this won't work if you aren't).

Go into your activemq configuration and add a new authorizationEntry for a new collective, call it whatever you like.

" write="mcollective" read="mcollective" admin="mcollective" />
" write="mcollective" read="mcollective" admin="mcollective" />

Now go into your mcollective server configuration and edit the main_collective and collectives settings.

main_collective = stalecollective
collectives = stalecollective,mcollective

Sit back and wait, I usually use the default checkin interval of 30 minutes, so waiting 60 minutes or so works well. Now run mco again against the new collective (edit your client.cfg or ~/.mcollective)

mco find -T stalecollective -v

You should see only your active hosts now. Possibly more interesting, run mco against the original collective and see the stale hosts

mco find -T mcollective -v

If you have hosts that checkin less frequently you might get a few false positives but this will still be a good starting point to find the nodes that aren't updating their configurations.

Mastering Puppet