KISScluster design
See the goals page for the framework for this design.
High level
The really hard part, that's what I gathered and what I can understand, is that number of cluster nodes: two, because that means there's no way a quorum can be reached. Split-brain, here we come, say goodbye to data consistency.
But: those two machines don't exist in an otherwise empty universe.
The machines are sitting right next to each other in a rack, connected to a single switch via their production interfaces (I'm certainly not gonna go the bond + redundant switches route, because this is a small setup and I've learned that too much complexity is the enemy of a working high-availability setup).
That switch has an IP address, and that is "always" reachable. "Always", as in: if the switch is down, there's no network to be provided with services, so we can just shut down the VMs with no harm.
This is how we solve the quorum problem.
In addition to this: KISS. make it do what it needs to do, nothing
more. It helps that DRBD tries very hard to avoid data inconsistency,
which should prevent the same VM from running on both hosts. Another
important thing: leave the big hammer to the humans -- that is, don't
try to force the system into the state we want if a component thinks
that's not working. Don't even try more than drbdadm primary
and
drbdadm secondary
. If the cluster is so inconsistent that VMs
don't come up, so be it. We can monitor for that and fix it manually.
Cluster logic and protocol
We'll borrow an idea from VRRP: have a priority value (which starts out at 100), and reduce that for every thing that doesn't work.
kissclusterd
sends out announcements on the storage and production
networks with its priority. It also listens for the peer's
announcements, and pings the switch. When a peer announcement or ping
reply isn't received on an interface in a timeout window, for each
missing interface the priority is reduced by 10.
Peer priorities as read from the announcements are stored. A dead peer (no announcements received) is assigned a priority of 0. When a peer sends different priorities on different networks, the higher value is used, to handle situations when the peer is in transition, and to be on the safe side.
The whole system has three systemd targets:
no-vms
: no VMs are runningprimary-vms
: only VMs where we're primary are runningall-vms
: all VMs, primary and backup, are running
kissclusterd
then just compares the priorities and switches the
respective systemd target:
- our == other:
primary-vms
- our < other:
no-vms
- our > other:
all-vms
And the systemd units for the VMs make sure DRBD primary selection and stuff is taken care of.
Packet format
The packet is in ASCII ('cause that's easy to parse in perl), one line:
proto_version:sender:priority:timestamp:HMAC
proto_version
is "1", for now.
sender
is the sender's IP address and must match the address we have
configured for the peer.
priority
is the priority the sender has calculated for itself.
timestamp
is the unix time_t
of transmission (and we allow a
skew of 10 seconds before we discard a received packet).
HMAC
is a SHA256 HMAC over the first four fields of the packet,
as sent, but excluding the trailing colon. It is encoded in Base64.
So with the HMAC we have packet authentication, and with the timestamp included, we have both replay protection and protection against mis-timed clusters syncing. However, that also means that's a split-brain situation (as we would have with a mismatched HMAC secret). DRBD still prevents multi-primary operation, so consistency is guaranteed.
systemd integration
Note that is was a nice idea, but has some limitations when it comes to automatic migration of the VMs away from a node that is going down. I cannot think of a way to do that with systemd that is free of races.
So in the end, we will need some kind of cluster resource manager, that keeps track of the VMs, where they do run, where they should run, and so on. But not in the first iteration.
This is where systemd with it's dependency handling comes in. We
define three targets, no-vms.target
, primary-vms.target
,
all-vms.target
, and start those accordingly. They conflict with
each other, so only one can be active at any time (systemd "targets"
are kinda like runlevels).
Then we define services for the VMs; these are wanted by either just
all-vms.target
, or all-vms.target
and primary-vms.target
,
so they're started when the respective target is active. Note that
they are not stopped automatically by the targets.
The VM services depend on an instance of drbd-vm@.service
for
their DRBD resources (this is an instantiated
service). That one
has StopWhenUnneeded
set, so it stops after the VM stops, freeing
the resource for the other node. This was a bit tricky to configure,
as stopping a VM can take quite some time, drbdadm primary
does
not wait, and I couldn't get a systemd unit to retry what should be a
oneshot service. So I resorted to writing a very small shell script
that just retries drbdadm primary
until it succeeds, and the
service definition's timeout handling does the rest.