KISScluster design

See the goals page for the framework for this design.

High level

The really hard part, that's what I gathered and what I can understand, is that number of cluster nodes: two, because that means there's no way a quorum can be reached. Split-brain, here we come, say goodbye to data consistency.

But: those two machines don't exist in an otherwise empty universe.

The machines are sitting right next to each other in a rack, connected to a single switch via their production interfaces (I'm certainly not gonna go the bond + redundant switches route, because this is a small setup and I've learned that too much complexity is the enemy of a working high-availability setup).

That switch has an IP address, and that is "always" reachable. "Always", as in: if the switch is down, there's no network to be provided with services, so we can just shut down the VMs with no harm.

This is how we solve the quorum problem.

In addition to this: KISS. make it do what it needs to do, nothing more. It helps that DRBD tries very hard to avoid data inconsistency, which should prevent the same VM from running on both hosts. Another important thing: leave the big hammer to the humans -- that is, don't try to force the system into the state we want if a component thinks that's not working. Don't even try more than drbdadm primary and drbdadm secondary. If the cluster is so inconsistent that VMs don't come up, so be it. We can monitor for that and fix it manually.

Cluster logic and protocol

We'll borrow an idea from VRRP: have a priority value (which starts out at 100), and reduce that for every thing that doesn't work.

kissclusterd sends out announcements on the storage and production networks with its priority. It also listens for the peer's announcements, and pings the switch. When a peer announcement or ping reply isn't received on an interface in a timeout window, for each missing interface the priority is reduced by 10.

Peer priorities as read from the announcements are stored. A dead peer (no announcements received) is assigned a priority of 0. When a peer sends different priorities on different networks, the higher value is used, to handle situations when the peer is in transition, and to be on the safe side.

The whole system has three systemd targets:

  • no-vms: no VMs are running
  • primary-vms: only VMs where we're primary are running
  • all-vms: all VMs, primary and backup, are running

kissclusterd then just compares the priorities and switches the respective systemd target:

  • our == other: primary-vms
  • our < other: no-vms
  • our > other: all-vms

And the systemd units for the VMs make sure DRBD primary selection and stuff is taken care of.

Packet format

The packet is in ASCII ('cause that's easy to parse in perl), one line:

proto_version:sender:priority:timestamp:HMAC

proto_version is "1", for now.

sender is the sender's IP address and must match the address we have configured for the peer.

priority is the priority the sender has calculated for itself.

timestamp is the unix time_t of transmission (and we allow a skew of 10 seconds before we discard a received packet).

HMAC is a SHA256 HMAC over the first four fields of the packet, as sent, but excluding the trailing colon. It is encoded in Base64.

So with the HMAC we have packet authentication, and with the timestamp included, we have both replay protection and protection against mis-timed clusters syncing. However, that also means that's a split-brain situation (as we would have with a mismatched HMAC secret). DRBD still prevents multi-primary operation, so consistency is guaranteed.

systemd integration

Note that is was a nice idea, but has some limitations when it comes to automatic migration of the VMs away from a node that is going down. I cannot think of a way to do that with systemd that is free of races.

So in the end, we will need some kind of cluster resource manager, that keeps track of the VMs, where they do run, where they should run, and so on. But not in the first iteration.

This is where systemd with it's dependency handling comes in. We define three targets, no-vms.target, primary-vms.target, all-vms.target, and start those accordingly. They conflict with each other, so only one can be active at any time (systemd "targets" are kinda like runlevels).

Then we define services for the VMs; these are wanted by either just all-vms.target, or all-vms.target and primary-vms.target, so they're started when the respective target is active. Note that they are not stopped automatically by the targets.

The VM services depend on an instance of drbd-vm@.service for their DRBD resources (this is an instantiated service). That one has StopWhenUnneeded set, so it stops after the VM stops, freeing the resource for the other node. This was a bit tricky to configure, as stopping a VM can take quite some time, drbdadm primary does not wait, and I couldn't get a systemd unit to retry what should be a oneshot service. So I resorted to writing a very small shell script that just retries drbdadm primary until it succeeds, and the service definition's timeout handling does the rest.