mcollective and activemq the 800 node limit

I've been running into the 800 node limit on mcollective and splitting up my nodes into subcollectives. I had a spot where I couldn't split up the nodes, so I started looking at why we were hitting this 800 node wall.

I'm using activemq with the ssl plugin, after turning on all the debugging I could find in activemq, it turns out it's just a simple resource limit problem.

With activemq running, I waited for my nodes to connect and watched the number of threads on the active java process. (This is after increasing the memory limits for activemq as described on puppetlabs website.

Getting the number of threads, two different ways.

$ pgrep java |xargs ps uH |wc -l
1023
$ pgrep java |xargs -I % ls -l /proc/%/task |wc -l
1023

Either way we are seeing around 1024 processes (threads), looks suspiciously like a limit. I increased the limit in /etc/security/limits.d/activemq.conf

activemq soft nofile 16384
activemq hard nofile 16384
activemq soft nproc 4096
activemq hard nproc 4096

Not really sure if the nofile limit is required, but nproc seems to fix my issue.
After restarting activemq

$ pgrep java |xargs ps uH |wc -l
1530
$ pgrep java |xargs -I % ls -l /proc/%/task |wc -l
1530

The number of nodes returned by mco find goes from a random result in the 800-1000 range to the 1400 or so that I was expecting.

I'm going to have to update my section on mcollective in my book

Wordpress category: 

Comments

thanks, i have seen the same behaviour in PE 3.7.1 and used same lines as above expect for the PE amq user which makes it like the following:

pe-activemq - nofile 16384
pe-activemq - nproc 4096

Following on from this, after the limits issues you might see the following:

ARP table full

The logs will contain something similar to:


Aug 10 22:42:36 s1 kernel: Neighbour table overflow.

This indicates the ARP table is full. Kernel parameters can be modified to overcome this issue


net.ipv6.neigh.default.gc_thresh3 defaults to 4096
net.ipv6.neigh.default.gc_thresh2 defaults to 2048
net.ipv6.neigh.default.gc_thresh1 defaults to 1025
net.ipv4.neigh.default.gc_thresh3 defaults to 4096
net.ipv4.neigh.default.gc_thresh2 defaults to 2048
net.ipv4.neigh.default.gc_thresh1 defaults to 1025

ActiveMQ & JVM OutOfMemory

This is very clear in the ActiveMQ logs that the JVM is out of memory.

The values for this can be modified in active.xml (mainly memoryUsage) and /etc/sysconfig/pe-activemq (wrapper.java.maxmemory), which can be all done via the `puppet_enterprise` module.

Limits

I know your blog covered the limits, but the logs will looks like this:


2014-08-10 13:34:13,921 | ERROR | Could not accept connection : java.net.SocketException: Too many open files | org.apache.activemq.broker.TransportConnector | ActiveMQ Transport Server: stomp+ssl://0.0.0.0:61613

and `lsof` can be used to show all the connections:


lsof | grep pe-activemq
java 860 pe-activemq 1174u IPv6 19825 0t0 TCP ip-172-31-11-221.us-west-2.compute.internal:61613->ip-172-31-14-66.us-west-2.compute.internal:37163 (ESTABLISHED)
java 860 pe-activemq 1175u IPv6 19827 0t0 TCP ip-172-31-11-221.us-west-2.compute.internal:61613->ip-172-31-2-116.us-west-2.compute.internal:47022 (ESTABLISHED)
java 860 pe-activemq 1176u IPv6 19831 0t0 TCP ip-172-31-11-221.us-west-2.compute.internal:61613->ip-172-31-14-78.us-west-2.compute.internal:55815 (ESTABLISHED)
java 860 pe-activemq 1177u IPv6 19833 0t0 TCP ip-172-31-11-221.us-west-2.compute.internal:61613->ip-172-31-10-134.us-west-2.compute.internal:53859 (ESTABLISHED)

Hope this helps