This is outdated and for historical amusement only. See kisscluster for the start page of the project.

KISS the complicated clustering good-bye!

this is a rant. and a design document. or maybe just a testament to the fact that I'm an entitled asshole who's not bright enough to configure stuff. we'll see, as this work-in progress progresses.

the problem at hand

you have a pair of servers that you want to run virtual machines on. Xen, KVM, whatever. you want load distribution and redundancy.

fine, you think, just use linux clustering stuff and DRBD and you're done.

the rant

DRBD is fine. I got it to run with an amount of effort that seemed adequate to the complexity of the problem it solves.

I had worked with heartbeat years ago, and while it did work, I hated it because it had cost me like half a day to find a problem in a config file, because the fscking error messages were so misleading and debugging facilities were as lacking as the documentation.

so I found that the cool kids do use corosync and pacemaker these days, and I started to build a config. as with many open source projects, there's a lot of in-depth documentation, but the "big picture" document is missing. just that with these, it's even more so. no simple config examples to be found, just snippets that solve very specific and exotic problems. apparently I'm supposed to build a config from first principles, reinventing the wheel for the N-th time as many others before me must have done, and that all in XML or "simple and intuitive" command line syntax. thanks, but no thanks.

think about this (from the DRBD docu):

crm configure
crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \
                    params drbd_resource="mysql" \
                    op monitor interval="29s" role="Master" \
                    op monitor interval="31s" role="Slave"
crm(live)configure# ms ms_drbd_mysql drbd_mysql \
                    meta master-max="1" master-node-max="1" \
                         clone-max="2" clone-node-max="1" \
                         notify="true"
crm(live)configure# primitive fs_mysql ocf:heartbeat:Filesystem \
                    params device="/dev/drbd/by-res/mysql" \
                      directory="/var/lib/mysql" fstype="ext3"
crm(live)configure# primitive ip_mysql ocf:heartbeat:IPaddr2 \
                    params ip="10.9.42.1" nic="eth0"
crm(live)configure# primitive mysqld lsb:mysqld
crm(live)configure# group mysql fs_mysql ip_mysql mysqld
crm(live)configure# colocation mysql_on_drbd \
                      inf: mysql ms_drbd_mysql:Master
crm(live)configure# order mysql_after_drbd \
                      inf: ms_drbd_mysql:promote mysql:start
crm(live)configure# commit
crm(live)configure# exit
bye

and imagine you've got that running some day in the past and did other stuff for months; and now it's broken and you've got to get your mysql server running again ASAP. simple and intuitive, my ass.

more searching brought me to ganeti, then, which is what google uses to tackle the problem I have, except at web scale. looks nice and well-documented, pre-packaged for debian, what could go wrong?

well, reality, that's what can go wrong. one thing is that the lack of documentation about basic clustering stuff (i.e. detecting which machine is up and failover and so on) should have warned me that there's something missing. plus when I got the real hardware to play around with, I quickly found out that it's pretty easy to break ganeti in a way where you have two machines that think they're master, but refuse to start the master daemon (which, btw, all the docs list as ganeti-masterd, but it's actually ganeti-luxid), and I had a very hard time to get it up again. and now there's still an instance DRBD disk that's considered inconsistent and that I can't seem to fix without re-creating the instance. the fact that ganeti configures DRBD on the fly and not via the usual config files doesn't help there. oh, and I found the place in the docs where it says what they plan to do WRT basic clustering some time in the future, and how hard that is to solve correctly.

in the end, it's a piece of software that's prone to breaking catastrophically, and it may be production-ready if you have a well trained on-call staff with the developers on speed-dial.

to me, it is way too opaque, and while it does solve quite a few of the tedious issues at hand (like setting up and managing VMs and stuff), it completely doesn't solve the hard problems.

I have two interpretations for this clusterfuck:

a), it's all very complicated due to the nature of the general problem, and having only two hosts is a very hard to solve degenerate case. I'm too dense to understand that my problem cannot be solved as conveniently as I would like.

b) this stuff was designed by CS graduates, who have all the theoretical background and try to solve the problems at hand in the most elegant and generic way, completely ignoring reality and practicality. plus second system syndrome in the case of corosync and pacemaker.

whatever it may be, those pieces of software certainly don't solve my problem in a way that I'm happy with.

so I was walking home from work way too late and way too frustrated, and it occured to me that I had solved seemingly hard (or expensive) problems before by being my usual ignorant self and going...

how hard can it be?

seriously, fsck the general case, let me try and design a solution to the specific problem at hand: a two machine virtualisation cluster.

the really hard part, that's what I gathered and what I can understand, is that number: two, because that means there's no way a quorum can be reached. split-brain, here we come, say goodbye to data consistency.

but: those two machines don't exist in an otherwise empty universe.

to keep stuff simple, the network setup is as follows:

a storage network, which is just a piece of CAT5 between the machines that is used for DRBD traffic
a production network, which has the VM production traffic and whatnot.

the machines are sitting right next to each other in a rack, connected to a single switch via their production interfaces (I'm certainly not gonna go the bond + redundant switches route, because this is a small setup and I've learned that too much complexity is the enemy of a working high-availability setup).

that switch has an IP address, and that is "always" reachable. "always", as in: if the switch is down, there's no network to be provided with services, so we can just shut down the VMs with no harm.

I think I can solve the quorum problem that way, but I'm thinking this thru as I'm writing, so stay with me (or turn away in disgust, as you like).

in addition to this: KISS. make it do what it needs to do, nothing more. it helps that DRBD tries very hard to avoid data inconsistency, which should prevent the same VM from running on both hosts. another important thing: leave the big hammer to the humans -- that is, don't try to force the system into the state we want if a component thinks that's not working. don't even try more than drbdadm primary and drbdadm secondary. if the cluster is so inconsistent that VMs don't come up, so be it. we can monitor for that and fix it manually.

design, sorta

note: this is from the early stages of the project, and only left in for historical purposes. please refer to design for the current status.

we'll borrow an idea from VRRP: have a priority value, and reduce that for every thing that doesn't work.

kissclusterd sends out announcements on the storage and production networks with its priority. it also listens for the peer's announcements, and pings the switch. when a peer announcement or ping reply isn't received on an interface in a timeout window, for each missing interface the priority is reduced by 10.

peer priorities as read from the announcements are stored. a dead peer (no announcements received) give it a priority of 0. when a peer sends different priorities on different networks, the higher is used, to handle situations when the peer is in transition, and to be on the safe side.

the whole system has three systemd targets:

no-vms: no VMs are running
primary-vms: only VMs where we're primary are running
all-vms: all VMs, primary and backup, are running

kissclusterd then just compares the priorities and switches the respective systemd target:

our == other: primary-vms
our < other: no-vms
our > other: all-vms

and the systemd units for the VMs make sure DRBD primary selection and stuff is taken care of.

packet format

the packet is in ASCII ('cause that's easy to parse in perl), one line:

proto_version:sender:priority:timestamp:HMAC

proto_version is "1", for now.

sender is the sender's IP address and must match the address we have configured for the peer.

priority is the priority the sender has calculated for itself.

timestamp is the unix time_t of transmission (and we allow a skew of 10 seconds before we discard a received packet).

HMAC is a SHA256 HMAC over the first four fields of the packet, as sent, but excluding the trailing colon. it is encoded in Base64.

so with the HMAC we have packet authentication, and with the timestamp included, we have both replay protection and protection against mis-timed clusters syncing. however, that also means that's a split-brain situation (as we would have with a mismatched HMAC secret). DRBD still prevents multi-primary operation, so consistency is guaranteed.

systemd integration

this is where systemd with it's dependency handling comes in. we define three targets, no-vms, primary-vms, all-vms, and start those accordingly. they conflict with each other, so only one can be active at any time.

then we define services for the VMs; these are wanted by either just all-vms.target, or all-vms.target and primary-vms.target, and have set StopWhenUnneeded=true, which stops them when the current target does not want them (they're "wanted" and not "required" because that allows you to manually stop them without the target stopping with them).

and the VM services depend on an instance of drbd-vm@.service for their DRBD resources. that one also has StopWhenUnneeded set, so it stops after the VM stops, freeing the resource for the other node. this was a bit tricky to configure, as stopping a VM can take quite some time, drbdadm primary does not wait, and I couldn't get a systemd unit to retry what should be a oneshot service. so I resorted to writing a very small shell script that just retries drbdadm primary until it succeeds, and the service definition's timeout handling does the rest.

what's missing?

there's a bunch of management scripts that's got to be written:

show system/cluster status overview
create VM configurations and deploy them to the cluster
create logical volumes and DRBD configs for VMs
install VMs
a service file for kissclusterd
monitoring integration; either check_mk or nagios

more ranting

calling qemu

qemu is... like a thick jungle. that seems to be why most people use libvirt. if you try to find out how to run it directly, you have a man page with all the command line options (at least I hope it's complete), and you start to think "hey, wouldn't a config file be a nice thing for a program that's regularily called with a 300+ character command line?" and you find -readconfig. and then you start to search for the documentation for the config file, but there is none. and then you add -writeconfig foo to your existing command line and find that it's only writing about 80% of the options into the file. and you still don't find docs on the file format and you start to look at the source.

and then you give up and go back to command lines, like everybody else does.

libvirt

I'm just not going there. another intermediary between my primary config data and the machine. with XML. nope. I don't need the flexibility to run virtualisation platform A today and B tomorrow, and if I ever switch, I'll write some new templates and be done. one less tool stack to learn.

first design (historical)

so... what a machine really needs to know is a status:

OK: run the assigned VMs
standalone: run all VMs
passive: run no VMs
(dead): this is what the other machine thinks when it doesn't see it.

the first three states are also what the machines advertise in their heartbeat messages.

(deciding what to do then, that is, reconfiguring DRBD, starting/stopping VMs and stuff, is trivial).

on each machine, there's a daemon running that's sending out heartbeats (or pings to the switch). the decision table is like this:

peer (storage)	peer (production)	switch	result
standalone (on either net)		*	passive
OK	OK	OK	OK
passive/dead	passive/dead	OK	standalone
dead	OK	OK	passive, or OK (1)
OK	OK	dead	OK, or don't change? (2)
OK	dead	dead	passive (3)
OK	dead	OK	OK, or don't change? (4)
any other combination			passive (5)

(1) the storage net is down: go to passive (as VMs can no longer be DRBD-mirrored), or maybe keep going in a DRBD-standalone "degraded" mode with loss of redundancy?

(2) must mean switch is misconfigured/crashed.

(3) our production link is down, or switch is down

(4) other node's production link is down. it will detect that and become passive, and then we will take over.

(5) this is the famous "can't happen" case, which means we better err on the safe side.

state transitions

OK, that's it for the steady state, but how do we go about transitioning between states?

I think we need some more states, because as it is, a standalone node cannot decide whether its partner is passive because it is just starting up, or shutting down, or administratively down, or stuff. so how does cluster transition?

let's use multiple states, one for "how we see the cluster state" and one per VM as an operating state.

cluster state: this describes how the machine sees it's role in the cluster.

(dead): this is what the other machine thinks if it's not seeing us
unavailable: not available for running VMs, whether it is while booting up, shutting down, administratively down, or some local problem
available: everything checks out OK, but we're not running VMs
running: everything checks out OK and we're running VMs

does that work? what happens when a machine goes down in a controlled way?

state is changed to "unavailable"
the VMs are shut down gracefully
the DRBD devices are released

that means the remaining machine can prepare to take over the VMs:

change internal state to "standalone"
wait for the DRBD devices to become available, switch them to primary
start the VM

to be continued...