Testing your Pacemaker cluster the proper way

Do you want to test your Pacemaker installation but don't know how to do that properly? We'll help.

Every now and then, we're asked what the best method is to test the functionality of a Pacemaker cluster. The problem is: There is no general answer to this question. Thus, the first question to ask is: What do you want to test, really? 

Testing the fail-over capabilities of your cluster

Finding out whether fail-over works as expected in your cluster can be done in numerous ways. You could do it by kicking one node out of the cluster altogether, for example by pulling the power cord.

You can also run pkill -9 corosync on one of the two nodes. If you have STONITH set up, killing Corosync will show whether your STONITH configuration is working as expected, too!

And last but not least, you can always use reboot -f to forcefully reboot a machine.

You may also refer to this blog post from Martin to find out why cutting communication links between cluster nodes is not a valid test for fail-over functionality.

Testing the communication links

In normal operation mode, corosync-cfgtool -s should display your configured communication channels there and all rings should be marked as active with no faults:

Printing ring status.
Local node ID 1881843904
RING ID 0
	id	= 192.168.42.112
	status	= ring 0 active with no faults
RING ID 1
	id	= 192.168.22.112
	status	= ring 1 active with no faults

If you want to test whether corosync will properly detect ring outages, you can simply take one of the communication links down by pulling the cable. corosync-cfgtool -s will then show one ring marked as FAULTY:

RING ID 1
	id	= 192.168.22.111
	status	= Marking seqid 1567 ringid 1 interface 192.168.22.111 FAULTY

If you re-plug the cable afterwards, corosync (starting from version 1.4) should also automatically detect this and re-enable the ring.

Testing DRBD resource-level fencing

If you use DRBD in your cluster setup and have redundant communication channels set up along with the resource-level fencing function, manually cut the DRBD communication path between the two nodes. If you see appropriate constraints in crm configure show afterwards,  resource-level fencing works.

Testing the cluster's behaviour in Split-Brain scenarios

To brutally force your cluster into a Split-Brain situation, cut all communication links between the nodes without taking either node completely offline. After such a step, your two cluster nodes (the two so-called cluster partitions)  will consider themselves to be the only remaining node and will refrain from (re-)starting resources until STONITH is successful. With STONITH, nodes would bring up all resources concurrently.